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Abstract 


Principal stratification is a widely used framework for addressing post-randomization com- 
plications. After using principal stratification to define causal effects of interest, researchers 
are increasingly turning to finite mixture models to estimate these quantities. Unfortunately, 
standard estimators of mixture parameters, like the MLE, are known to exhibit pathological 
behavior. We study this behavior in a simple but fundamental example, a two-component 
Gaussian mixture model in which only the component means and variances are unknown, and 
focus on the setting in which the components are weakly separated. In this case, we show that 
the asymptotic convergence rate of the MLE is quite poor, such as O(n~!/®) or even O(n~'/8). 
We then demonstrate via theoretical arguments as well as extensive simulations that, in finite 
samples, the MLE behaves like a threshold estimator, in the sense that the MLE can give strong 
evidence that the means are equal when the truth is otherwise. We also explore the behavior of 
the MLE when the MLE is non-zero, showing that it is difficult to estimate both the sign and 
magnitude of the means in this case. We provide diagnostics for all of these pathologies and 
apply these ideas to re-analyzing two randomized evaluations of job training programs, JOBS 
II and Job Corps. Our results suggest that the corresponding maximum likelihood estimates 
should be interpreted with caution in these cases. 
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1 Introduction 


Finite mixture models are notorious for giving pathological results (Redner and Walker, 1984); 
indeed, Larry Wasserman has called finite mixtures the “Twilight Zone of Statistics” (Wasserman, 
2012). Our motivation for this paper is to understand how the pathological features of weakly 
separated finite mixture models affect inference for component means, especially with respect to 
estimating causal effects in the principal stratification framework, an important example of such 
inference. 

Principal stratification is a widely used approach for addressing post-randomization complica- 
tions, including noncompliance with treatment assignment (Frangakis and Rubin, 2002). Typically, 
the goal is to estimate causal effects within partially latent subgroups known as principal strata. 
While there are many possible ways to estimate these principal causal effects, the most common 
approach is via finite mixture models, treating the unknown principal strata as mixture compo- 
nents (Imbens and Rubin, 1997). To date, scores of applied and methodological papers have relied 
on finite mixtures to estimate causal effects, both explicitly and implicitly. 

To present our main results, we construct a simple two-parameter model that captures the 
essential features of the problem: maximum likelihood estimation for the component means and 


variances in a two-component location-scale mixture of Gaussian distributions, 
ss 
Y¥;~ N(0,00) + (1-7) N (141,01); (1.1) 


where the mixing proportion, 7 € (0,1), is assumed to be known. 

While the two-component finite mixture model in (1.1) is a toy example in some settings, it is a 
fundamental structure in many causal inference problems. For instance, in the canonical example 
of noncompliance in a randomized trial (Angrist et al., 1996), individuals randomly assigned to 
the treatment group who actually receive the treatment are a mixture of Compliers and Always 
Takers. Assuming that individual outcomes follow a Normal distribution yields the mixture model 
n (1.1). Thus understanding the difficulties of component-specific inference are vital to estimating 
parametric principal stratification models. 

The asymptotic properties of the MLE for the component means in Equation (1.1) are well 
established in two settings. First, when the difference in means, A = 1 — pg, is fixed, the MLE 
has strong asymptotic guarantees, including consistency and parametric convergence (Everitt and 
Hand, 1981; Chen, 2017). Second, when the mixture is degenerate, t.e., A = 0, the MLE has at 
most O(n—1/4) convergence (Chen, 1995; Heinrich and Kahn, 2018). This is closely related to the 
problem of testing the number of components in a finite mixture (McLachlan and Peel, 2004). 

In this paper, we focus on the behavior of the MLE when A is small but not zero. This 
“intermediate sample size regime” is an important case in practice and is especially relevant for 


principal stratification models. To set the stage, Figure 1 shows the distribution of the MLE of A 
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Figure 1: Distribution of A™le for 1000 fake data sets designed to reflect the JOBS II and JobCorps 
studies. Data sets were generated from the two-component homoskedastic Normal mixture model 
in Equation (1.1) with A = 0.5 and, respectively, (a) N = 132 and a = 0.45 and (b) N = 3,371 
and 7 = 0.06. 


for 1000 synthetic data sets generated from Equation (1.1) for two settings. The sample sizes and 
mixing proportions match those in our two key principal stratification examples, JOBS II and Job 
Corps. The assumed difference in component means is A = 0.5 standard deviations, which is quite 
large for many social science applications but smaller than in textbook examples of well-separated 
components. In both cases, the distribution of the simulated MLEs is markedly non-Normal. Both 
distributions have three notable features. First, there is a large point mass at zero. Second, a 
considerable portion of simulated MLEs have the opposite sign from the truth. Finally, simulated 
MLEs that are non-zero and have correct sign are not centered at the true value. To emphasize, 


these features are not due to model mis-specification: we estimate the MLE using the true model. 


1.1. Main contributions of our paper 


In this paper, we give theoretical explanations for some of the practical difficulties encountered 
in estimation in two component finite mixture models, as shown in Figure (1), and, based on our 
findings, suggest guidance for practice. 

We first, in Section 2, study the asymptotic properties of the MLE of the two component model 
(1.1) in the “intermediate sample size regime” when A > 0 as n > oo. This framework adequately 


captures weakly separated mixture components in relation to the sample size. Even for the basic 


model (1.1), not much seems to be known about the convergence rate of the MLE in this regime, 
especially when o9 and o; are unknown. We first establish the convergence rate of the MLE, 
resulting in several interesting findings for the model in (1.1) when o9 = 01. When o = 09 = oj 
is known and 7 # 5; the convergence rate can and does reach O(n—!/°) up to logarithmic factors. 
This is worse than the rate set for the degenerate case where A = 0, suggesting that small but 
non-zero separations are particularly difficult to estimate well. In such scenarios, our theoretical 
results explain the empirically observed difficulties in estimating A shown in Figure 1. For 7 = 5; 
we can only estimate the difference up to a sign due to identifiability issues. In this case, the 
convergence rate for estimating the magnitude of the parameter is a more rapid — yet still slow — 
O(n-"/4), 

When o is not known the worse-case convergence rate of the MLE remains O(n~!/°) for the 
a # § setting but falls to O(n-"/8) for the 7 = 5 setting — an order of magnitude worse than when 
o was assumed known. These results are quite novel and delicate to derive, as we have to carefully 
account for the interaction between the location and scale parameters. Interestingly, the results 
together show that while the convergence is faster for the symmetric case than the asymmetric case 
in the known variance regime, it is slower in the unknown variance regime. 

After presenting our convergence results, we turn to the practical difficulties in estimating A 
and formalize the phenomenon of the large point mass at zero shown in Figure 1. We call this 
phenomenon pile up. Specifically, we show via a mix of simulations and theoretical arguments that, 
in certain intermediate sample size regimes, Ame — 0 with very high probability even though A # 0. 
Thus, the MLE behaves like a threshold estimator analogous to the classic Hodges estimator (see 
Van der Vaart, 2000). We then show that pile up occurs when the overall mixture variance is less 
than the within-component variance. To the best of our knowledge, we are the first to document 
this pile-up phenomenon in finite mixtures. 

Next, we turn to using higher-order mixture moments for diagnosing pathologies with the MLE. 
First, we use these moments to bound the probability of pile up given either the realized data set 
or population parameters. We then discuss the classic problem of choosing the correct mode in 
a bimodal likelihood and argue that it is particularly difficult here. We show that this problem 
corresponds to estimating the sign of A (i.e., the relative ordering of jo and ju) and demonstrate 
how to use the third moment of the mixture distribution to assess the probability that this occurs. 
We combine these results with extensive simulations to show that, across a range of reasonable 
settings, the sign of the MLE for A is no better at predicting the true sign than a coin flip. 

We finally apply these mixture results to estimating principal stratification models in two ran- 
domized evaluations of job training programs, JOBS II (Vinokur et al., 1995) and JobCorps (Scho- 
chet et al., 2008). These two examples have been the focus of several prominent papers using finite 
mixtures for principal stratification (e.g., Zhang et al., 2009; Mealli and Pacini, 2013; Frumento 


et al., 2012) and highlight two main use cases for this framework. For both data sets, we slightly 


simplify the problem to isolate the pathologies of the finite mixtures. We then assess the observed 
mixture distributions using the diagnostics we propose and find that pathologies are quite likely. 
Consequently, we do not have high confidence in the quality of the maximum likelihood estimates 
of A™e — 0 for JOBS II and an implausibly large Amle for JobCorps. Our overall findings suggest 
that finite mixture models should be used with caution in settings such as these. 

Overall, the implications for parameter estimation in finite mixtures are both novel and impor- 
tant. In particular, there is a longstanding consensus in finite mixture modeling that the MLE can 
behave poorly when components are not well separated (Redner and Walker, 1984). Indeed, several 
experienced researchers have told us that estimating component-specific parameters is “hopeless” 
in the settings we consider. While we agree with this assessment, we argue that there are no clear 
guidelines for researchers in practice. In particular, how do researchers know when components 
are separated “enough” and what happens if they are not? This is especially important because, 
in settings with insufficient information in the data, the MLE gives a very plausible value of zero 
rather than ‘NA.’ We believe that the framework we lay out here is an important next step towards 


deeper understanding of these issues. 


Paper plan. Section 2 describes the asymptotic behavior of the MLE under weak separation. 
Section 3 explores the non-asymptotic behavior of the MLE and characterizes pile up. Section 4 uses 
the mixture moments for constructing diagnostics for the MLE. Section 5 gives a brief overview 
of the principal stratification framework and the connection to finite mixture models as well as 
an analysis of JOBS II. Section 6 provides additional discussion on implications for practice and 
possible research directions. Finally, the supplementary materials address several points that go 


beyond the main text, including proofs. 


Notation. For any two densities p and q (with respect to Lebesgue measure jy), the variational 
distance between p and q is given by V (p,q) = (1/2) f |p — q| du. Additionally, the squared Hellinger 
distance between p and q is given by h?(p,q) = (1/2) f (p\/? — gil)? du. Furthermore, the expres- 
sion Gy, = by is used to denote a, > Cb, for some C' that is independent of n. 


~N 


1.2 Related literature and previous work 


There is a vast literature on inference in finite mixture models, dating back to the seminal work 
of Pearson (1894). For thorough reviews, see Everitt and Hand (1981), Redner and Walker 
(1984), Titterington et al. (1985), McLachlan and Peel (2004), and McLachlan et al. (2019). Frithwirth- 
Schnatter (2006) focuses on the Bayesian paradigm; Lindsay (1995) gives an overview of moment 
estimators; and Moitra (2014) discusses relevant results from machine learning. We briefly highlight 
several relevant aspects of this literature. 


First, there has been extensive research on the asymptotic behavior of finite mixtures mod- 


els. Chen (2017) gives a recent, comprehensive review. Much of this literature, however, is about 
the problem of testing the order of the finite mixture (see McLachlan and Peel, 2004). There are 
several recent papers that instead address estimation. Chen et al. (2014) focuses on estimating the 
mixing proportion when components are only weakly separated. Ho and Nguyen (2016) gives results 
for the over-specified location-scale Gaussian mixtures. Gadat et al. (2016) study the convergence 
rate of L?-norm estimators for a few settings of two component models. Finally, Anandkumar et al. 
(2012); Hardt and Price (2015); Wu and Yang (2018) explore the asymptotic properties of method 
of moments estimators in rather general settings of Gaussian mixtures. 

Second, the problem of weak separation is a special case of the weak identification problem 
especially common in econometrics. There are many examples of weak identification in other 
settings, including the weak instruments problem (Staiger and Stock, 1997) and the moving average 
unit root problem, which is the source of the term pile up (Shephard and Harvey, 1990; Andrews 
and Cheng, 2012). See also Chen et al. (2014). 

Finally, although the technical discussion focuses narrowly on finite mixtures, our motivation 
remains the broader question of inference for causal effects within principal strata. To date, only a 
handful of papers have directly addressed the finite sample properties of mixtures for causal infer- 
ence. Griffin et al. (2008) conduct extensive simulations and conclude that principal stratification 
models are generally impractical in social science settings. Mattei et al. (2013) caution that uni- 
variate mixture models often yield poor results and suggest jointly estimating effects for multiple 
outcomes, such as by assuming multivariate Normality. Mercatanti (2013) proposes an approach 
for inference with a multimodal likelihood in the principal stratification setting. Frumento et al. 
(2016) explore methods for quantifying uncertainty in principal stratification problems when the 
likelihood is non-ellipsoidal. See also Chung et al. (2004), Zhang et al. (2008), Richardson et al. 
(2011), and Frumento et al. (2012). 


2 Asymptotic properties of the MLE: Phase transition 


In this section, we study the asymptotic behavior of the MLE under two distinct but representative 
settings of model (1.1): first, when the variances op and o; are assumed known and equal; second, 
when the variances og and oj are unknown but assumed to be equal. Overall, we demonstrate that 


worst-case convergence when the components are close together is generally slow. 


2.1 Known variances setting 


Motivated by the illustrative simulations in Figure 1, we now explore the properties of the MLE, 
Ame, when A is small but non-zero. In the classical asymptotic regime, where A is fixed as in 
Equation (1.1), it is immediate that Ame has a parametric rate of convergence in this simple exam- 


ple (see Redner and Walker, 1984; Chen, 1995). However, as shown in Figure la, this asymptotic 


regime can be a poor approximation to reality when components only have moderate separation. 
We therefore consider an asymptotic regime in which A,, shrinks as n increases. Our core finding is 
that, under this regime in which the two components are only slightly separated and the variance 
is known, the convergence rate of the MLE for the difference in means is quite poor. 

Under the assumption that variances are known, we re-parametrize Equation (1.1) and assume 
that Yi,i € {1,...,n}, are iid. samples from the model: 


¥iS nN (u—dn,0) +(1—7)N (w+ nso), eo) 


where c := ;*— and 4, € © is a free parameter that varies with n. We assume the equal variance 


case of 09 = 0; = o for a known o. Relative to Equation (1.1), wo = w— don, W1 = + Con, 


A = (1+ c)én, and yp is the overall mean, EY; = yu. For simplicity, we set 4 = 0; all of the results 


in this section are applicable for any w € R. When yu = 0 then the 6, parameter is both the 
(negative) location of the first component as well as scaling of the separation of components A; 
it thus corresponds to both a location and a separation parameter. We focus on this separation 
parameter 6, for ease of mathematical derivations; because A from Equation (1.1) is a constant 
re-scaling of 6, all the asymptotic results equally apply. We further assume that 6, € O where O is 
a compact subset of R and 0 € @. Finally, define 6™!° as the MLE for dy, for the model in (2.1). 
The following result shows the convergence rates of MLE for (2.1) where the variances are 


assumed to be known: 
Theorem 2.1. For the model (2.1), the following holds for any « > 0 
(a) (Asymmetric regime) When x € (0,1/2), then 


1\/6 _ 1 1/6 
cx(e) (4) 2 Ge 5, (| — Sul) < cao (8) | 


u ~ bn€O1,n(€) Hm 


where Oi n(e) = {6 : |d] < gre 


(b) (Symmetric regime) When 7 = 1/2, then 


aw(2)"< a ss, (|| — fall) < exo (282) 


u bn €O2,n(€) a 


where Oon(e) = {6 :|5| < n-/44*}. 


Here, E5,, denotes the expectation taken with respect to the product measure with mixture density 


of Y1,.--,Yn under the model (2.1). Furthermore, Cy(e) and C2(e) are two positive constants 


depending only on e. Symmetry gives an analogous result for 7 € (1/2,1). 


The proof of Theorem 2.1 is provided in Appendix G.1. The variance parameter, 0 is subsumed in 


the constants and does not impact the rates. 


Prior work (Chen, 1995) has shown that when 6, = 0 the rate is of order n~!/4 for the asym- 
metric case; the above therefore shows that there exists some 6, 4 0 in a neighborhood of 0 where 
convergence is even worse than this degenerate case. In particular, an immediate consequence of 
this theorem is that, for 7 4 1/2, there exists a sequence of db, going to 0 at no more than a n V6 
rate such that the error of the MLE is also of order n~!/®, 

For the symmetric regime we are simply looking at difference in magnitude, not sign. This is 


because when 7 = 1/2 the sign of 6, is not identifiable, and we find that 


sup E;, |6™° — 6,| > n7/", 
bn€O 


for any r > 2 and for any fixed parameter space ©. Here, Es, denotes the expectation taken 


with respect to product measure with mixture density of Y1,...,Y, under the model (2.1); see the 


Appendix G.3 for the proof. 


Connections to the Wasserstein metric. The above connects to the Wasserstein metric, which 
has recently been used to study parameter estimation in mixture models (Nguyen, 2013; Ho and 
Nguyen, 2016; Heinrich and Kahn, 2018), for additional interpretation of the results in Theorem 
2.1. In particular, let Guile denote a probability measure (or equivalently mixing measure) with two 
atoms (—d™e, céme) whose weights are (7,1 — 7) and G, a probability measure with two atoms 
(—6n, Cdn) whose weights are (7,1 — 7), then we can verify that the results of Theorem 2.1 are 


equivalent to 


sn" 


Cylon" <_ sup Ey, (Wa(Gz"*,Gn)) <_ sup Es, ([Ba"*- onl) < Cale) ( = 


bn €O1.n(€) bn €O1.n(©) 


under the asymmetric regime and 


mle 
on 


— [bn 


Ci(on/4< sup Ey, (Wwa(Gr", Gn)) < sup Es, (| ) < Ca(e) (“2") 1/4 


bn€92,n(€) bn €O2n(€) 


under the symmetric regime. 


2.2. Unknown equal variances setting 


We now show that our previous results still generally hold when we relax the restriction that the 
variances are known. For the unknown equal variances setting, we assume that Y;,..., Y, are i.i.d. 
samples from a two component location-scale Gaussian mixture with density 


Y, © tN (1 — Sn, on) + (1 —)N (e+ Cin, On) - (2.2) 


Here, 6, and o, change with the sample size n and converge to some limit points. We assume 
On € Q, a compact subset of R+. We set the overall mean of 44 = 0 for convenience as before; dy, is 
again a scaling of the gap between the two mixture means. We define (ome, emle) as the MLE for 
the separation and scale parameters for the model in (2.2). Unlike the previous convergence results 
with On in the case with known variance, the convergence rates of bn and o,, are much harder to 
establish due to the strong dependence between the seperation parameter 6 and scale parameter o, 
which is determined by the following partial differential equation (PDE): 
2 

a Low, Ot) = 2 tn, eae (2.3) 
for all x, 6,0 and Normal density f. This dependence leads to worse convergence rates for parameter 
estimation for over-fit location-scale Gaussian mixtures (Ho and Nguyen, 2016) and for hypothesis 
testing for the number of components of location-scale Gaussian mixtures (Chen and Chen, 2003). 
Under the specific setting that we consider, this dependence leads to a new characterization of the 
asymptotic behavior of gmle |o™e|, and G™° under the two regimes 7 € (0,1/2) and = 1/2. To 


the best of our knowledge, these have not been previously addressed in the literature. 


Theorem 2.2. Take m € (0,1/2]. Under the unknown equal variances setting (2.2), the following 
holds 


(a) (Asymmetric regime) When x € (0,1/2), then 


a ml 2 ley2 2 logn e 
CH) (Z) 0 <_ sup Bebop [Oi — dnl? + 14)? — 02] ) < Cale) (E*) 


n (6n,on)ES1,n(€) 


where Sin(€) = {(5n, On) : |Onl? + |(on)? — (6)?| < pert for any « > 0 and some positive 


constant o. 


(b) (Symmetric regime) When x = 1/2, then 


1\1/4 
Cy(€) (<) = sup Enon) { 


n (bn ,on)ES2,n(€) 


lon*| — [dnl 


a bony" 
+ (en)? — ofl) < Calo) ( ) 


where Son(€) = {(5n, On) : |On|? + |(on)? — (6)?| < ny ierey for any « > 0 and some positive 


constant o. 


Here, Ey, 5,) denotes the expectation taken with respect to a product measure with a mixture density 
of Y1,..., Yn under the unknown equal variances setting (2.2). Furthermore, Cy(€) and C2(e) are 


two positive constants depending only on e. 


The proof of Theorem 2.1 is provided in Appendix G.2. 


A few comments are in order. First, under the asymmetric regime, the convergence rate of the 
separation parameter gmle to dn is of an order no more than n~!/6 (due to the squared term within 


the expectation) while that of scale parameter (™*°)? to (o,)? is no more than order n—!/8 


, as long 
as the true parameters 6, and oy, belong to Sj,,(€). The PDE of the distribution in (2.3) suggests 
the faster apparent convergence rate of the scale parameter relative to the separation parameter. 
Second, under the symmetric regime, the worse-case convergence rate of |5™*| to |6n| is n7!/8, 
which is slower than the worst-case rate n~!/4 of (™*)? to (on), when the true parameters dp, 
and o7, belong to S2,n(€). Here, we consider the absolute value of the separation parameter for the 
convergence as the sign of separation parameter is not identifiable under the symmetric setting. 
Furthermore, in contrast to the know variance setting (2.1), the worse-case convergence rate of 
separation parameter under the symmetric regime is slower than that of separation parameter 
under the asymmetric regime. That fundamental difference can be again explained by the PDE of 


the location-scale Gaussian distribution. 


3 Non-asymptotic properties of the MLE: Pile Up 


Thus far, we have established rigorous asymptotic (minimax) behaviors of MLE under the asym- 
metric and symmetric cases of model (2.1) and model (2.2). The goal of this section is to shed some 
light on the non-asymptotic sample properties of the MLE. To facilitate the discussion, we focus 
solely on the known variances setting (2.1), i.e., we want to analyze the non-asymptotic behavior 
of MLE when 6, is near zero. We work with the likelihood function of our re-parameterized model 
(again, setting 4 = 0). This allows us to directly obtain statements regarding the points of the 
maximum likelihood which in turn allows for the characterization of the MLE’s behavior. In par- 
ticular, we first show that under our parameterization, zero (corresponding to no separation) will 
always be an inflection point if not a local mode. Finally, we show that, in general, the local mode 
is in fact the MLE when the estimated overall variance is less than o, the assumed component 


variance. 


3.1 Zero as a local mode of the likelihood 


Given an observation Y = y from the mixture model (2.1), the log-likelihood for 6, is 
Lo,|¥ =y) = log (ne 890-tol? +(1- me Se ) ; (3.1) 


where we set o = 1, though these results immediately extend to arbitrary 0. The score function is 


then : : 
ge POU ie, =e me nl (y= 8),) 
re—9-5(yt5n)? ae (1 _ m)e—9-5(y—Cbn)? 


L(5n/Y = y) = ; (3.2) 


Since c = ;*— with 7 € (0,1/2], it follows from (3.2) that 


(O|Y = y) =0, forall y ER. (3.3) 


Given the samples Y, = (Yi, Y2,... Yn) from model (2.1), Equation (3.3) yields the following 


approximation of the log-likelihood given samples Y ,: 
1 
L(6n|¥n) = L(0|Yn) + xf (OY n)on a O(6a). (3.4) 


In the event that ¢’(0/Y;,) < 0, zero is a local mode for the log-likelihood function €(dn|/Yn); 
we call this event 
E = {L"(0/¥n) < Of. (3.5) 


Direct calculation yields that 


£"(0|¥ n) -0( ye " ; (3.6) 


and thus ¢”(0|Y;,) < 0 when 3“, Y? < n. Equivalently, 2’(0/Y,) < 0 when m2 < 1, where 
ima = 407, Y? is the observed second moment of the mixture distribution, and the assumed 


within-component variance is 1. We return to this connection to higher-order moments below. 


3.2 Zero as the global mode of the likelihood 


After establishing that zero is a local mode of the likelihood when ¢”(0|Y,,) < 0, an important 
question is whether zero is also a global mode in this case. Let F = {gle = 0} be the event that 
zero is also the global mode for the likelihood function ¢(6|Y), where gmle is the MLE under the 
setting of model (2.1). We refer to the event F as pile up throughout the paper. While it is clear 
that F Cc €, the reverse implication is not trivial. We divide our analysis into two cases: 7 = 1/2 
and m € (0,1/2). We again denote fz := 4 7, Y?. 


Symmetric case. When 7 = 4, conditioning on the event € (equivalently 2 < 1), we can check 
that 
n 


_4 Se x 
£"(5|¥ n) er, 1<m2-1<0 
1 (exp (0%) + exp(dYi))? 


where the inequality is due to applying Cauchy-Schwarz exp(—dY;) + exp(6Y;) > 2 for all i € 
{1,...,n}. The above inequality implies that the log-likehood function ¢(6|Y,) is strictly concave 
under the event €. Therefore, zero is the global maximum of the log-likelihood function under the 


event €. This leads to the following result regarding pile up. 


10 


Proposition 1. Under the symmetric setting of location-scale Gaussian mixtures with known vari- 


ances, E =f, 1.e., pile up occurs as long as O is a local maxima of the log-likelihood function. 


The result of Proposition 1 suggests that we can rewrite the representation of MLE under 


symmetric setting with known variances as 


0, if m2 <1 


O,(n-"/4), if iy > 1 


Smle _ 
on 


Thus, at least in the symmetric case, the MLE behaves like a threshold estimator analogous to the 


classic Hodges estimator (see Van der Vaart, 2000). 


Asymmetric case. Unlike the symmetric case, we can see via simulations that there are instances 
for which € # F in relatively small samples. Nonetheless, these counter-examples are fairly rare; 
for A = (1+ c)b, = 0.25, {E NF} occurs in fewer than 3 percent of simulation draws with sample 
sizes less than N = 500, decreasing to below 1 percent with samples sizes of N = 1000 or more. 
Extensive simulation studies seem to imply that P,,(F) 7 P,(€).! We do not have a rigorous proof 


of this and therefore state it as a conjecture: 


Conjecture 3.1. Under the asymmetric setting of location-scale Gaussian mixtures with known 
variances, if On = Opin, ten: Hit, +. Pelé 1.F) =. 


Thus Conjecture 3.1, if true, implies that, for the asymmetric setting of location-scale Gaussian 
mixtures with known variances, the probability that pile up occurs, i.e., On = 0, can be well 
approximated by the event {@”(0|Y,,) < 0}. In other words, we can safely ignore the case in which 
zero is a local but not a global mode of the likelihood. 

Figure 2 shows this pile up phenomenon in practice. Specifically, Figures 2a and 2b show the 
likelihood surfaces for two data sets generated via Equation (1.1), with N = 200, 7 = 0.35, and 
A = (1+ c)d = 0.6. In Figure 2a, the likelihood is bimodal and the global mode is close to the 
truth, albeit more extreme.” In Figure 2b, the likelihood is unimodal and centered at zero, which 


is far from the truth. 


4 Diagnostics for MLE pathologies 


The results above suggest that the higher-order moments of the mixture distribution play an im- 


portant role in the finite sample properties of the MLE. We now construct diagnostics for the MLE 


‘The index n denotes the fact that the sampling distribution in (2.1) changes with n. 

?'The characterization of ome as a Hodges-like estimator suggests that the MLE will be biased away from zero when 
one # 0. This is closely related to the bias induced by introducing identifiability constraints, such as 6 > 0 (Jasra 
et al., 2005; Friithwirth-Schnatter, 2006). In both cases, the MLE is the maximum of a truncated likelihood surface, 
truncated at the line 6 = 0. 
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Figure 2: Two example likelihoods for component means, with data generated via Equation (1.1) 
with parameters N = 200, 7 = 0.35, and A = 0.6. The ‘+’ denotes the true component means. 
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using these moments. First, we use these higher-order moments to construct diagnostics for pile 
up for the MLE, specifically the probability that pile up will occur given a set of moments, either 
observed moments or assumed moments. We then construct similar diagnostics for the relative 
order of the components, as captured by the sign of A. Throughout, we consider the setting with 


known variances, since the corresponding moment equations are tractable in this case. 


4.1 Probability of pile up 


The probability of pile up can be characterized by using the sampling distribution of the second 


moment, Y?. In particular, we can determine P{m2 < 1} using the first three moments of Y?: 


m2 = E[Y?] = 1+ cé2 (4.1) 

vq = V[Y?] = 34 3(r 4 c*(1 — 1))d4 — m3 (4.2) 
1 

T5 = 3a Y? — mg], (4 3) 
U9 


where we can obtain Iz via Monte Carlo methods. Using the Berry-Essen theorem for the con- 
vergence rates of a CLT, and assuming Conjecture 3.1, we can obtain the following bound for the 


As we show in simulations, ®(b,) appears to be an excellent approximation to the empirical pile 


(4.4) 


up probability, even though the bound, which depends on the sixth mixture moment, can be wide 
in practice. See supplementary materials. 

We can use this result for practical diagnostics, both for planning a future analysis and for 
assessing a particular data set. Figure 3a shows the pile up probability computed via simulation 
and via Equation (4.4), with m = 0.325, A = (1+ c)é, = 0.25, and varying n. First, there is 
excellent agreement between the simulations and the Normal approximation, though ®(b,) slightly 
under-states the probabilities obtained via simulation. Second, while the probability of pile up is 
decreasing in both n and A, it is hardly a “small sample” issue. For A = 0.25, which would be 
quite large in many social science applications, pile up remains a meaningful possibility even with 
sample sizes in the thousands. For A = 1.0, which would be an implausibly large separation in 
many settings, the probability of pile up is still greater than 1 in 4 for n = 5,000. Finally, Figure 3b 
shows similar results for a moderate sample size of N = 200 but varying mixing proportions. In 
this case, the probability of pile up decreases as 7 approaches 0.5. We believe that figures such as 
these are useful diagnostics before observing the mixing distribution itself. 


We can also incorporate information from the observed mixture distribution. First, we can 
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Figure 3: Probability of pileup given sample size and separation of means. Dotted lines are simu- 
lated values across 5,000 simulations; solid lines use the Normal approximation, ®(b,). 


n~ 


plug in the observed empirical moments, fz and 02, to calculate 6 = ae and ®(b). This relies 
Bo /n 


on the Normal approximation for the sampling distribution as well as precisely estimating %o, 
which is the fourth moment of the observed mixture distribution and might be noisy in practice. 
Alternatively, we could use a case-resampling bootstrap to estimate P{m2 < 1}. Note that this is 
not the same as using the case-resampling bootstrap to estimate standard errors, which we advise 
against (see supplementary materials). Rather, this is analogous to the use of the bootstrap as 
a diagnostic tool in finite mixtures; see, for example, Griin and Leisch (2004). Finally, we note 
that an estimated MLE of zero still provides some information about the unknown parameter. For 
instance, if A™*® = (e+ 1)dn = 0, A = 0.2 is a much more plausible value than A = 2.0. We discuss 


this in the supplementary materials. 


4.2 Probability of a sign error 


We now turn to the sign of Ame when 1 # 1/2 (the sign is not estimable when 7 = 1/2). Specifically, 
we define a sign error as sen Cos # sen(A). This is a well-studied issue in mixture modeling; 
for example, choosing the true mode in a multimodal likelihood is a classic problem (see Gan and 
Jiang, 1999; Biernacki, 2005). Redner and Walker (1984) give a foundational review of asymptotic 


versus local identifiability in mixtures. For a more recent perspective, see Kim and Lindsay (2015), 
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who introduce the concept of empirical identifiability. 

As with pile up, we use higher order moments for diagnosis. This is slightly more complicated 
than for pile up because sgn (4) is undefined when A = 0. Thus, we need to consider the joint 
sampling distribution of both the second and third moments. In the setting with known, equal 


variances in Equation (2.1), we have the following moment equations: 


m2 = E[Y7] =14+a(1—m)A? 
m3 = E[Y?] = 7(1 — 7)(1 — 2) A®. 


Following Tan and Chang (1972), the corresponding sample moments have the following distribu- 


m oa m2 1 Ki A* + 2m3 KyQA? + a . (4.5) 
M3 M3 n Ko2aA® + Ka2pmaA* + 6m3 


with constants «1; = m(1— m)(1 — 67(1 — 7)); Kg = w(1 — m)(1 — 27) (1 — 120(1 — 71)); Ko2a = 
n(1—2)(1—30n(1 — 2) + 1207(1 —)?) + 927(1 —77)2(1 —277)?; and Koop = 9x(1—7)(1—62(1—77)). 


Thus, we can approximate the joint probability of pile up, sign error, or neither for a given A, n, 


tion: 


and 7, where we set A > 0 for illustration: 


P ({pile up; sign error; neither}) ~ (4.6) 
P({m2 <1; mg >1Nms3 <0; m2 >1N ms > OF) , 


If desired, we could apply a similar Berry-Essen bound for these probabilities, as in Equation 
(4.4). Instead, we simply invoke the Central Limit Theorem and use the Normal approximation in 
Equation (4.5). 

Figure 4 shows the conditional probability of sign error given no pile up across values of N 
and A found by two methods: (1) direct simulation (simulations are restricted to draws in which 
Ame 4 0): and (2) the tail probability of Equation (4.6) based on the Normal approximation 
in Equation (4.5). While the probability of a sign error decreases in both n and A, it remains 
remarkably high over plausible parameter values. Indeed, for A = 0.25 the sign of A is essentially 
a coin flip, even with a sample size of 5,000. Importantly, conventional approaches for standard 
errors in the MLE (McLachlan and Peel, 2004) typically ignore this uncertainty. For additional 
discussion, see Kim and Lindsay (2015). 

As in Section 4.1, we can assess the probability of sign error in practice. Based only on the 
sample size and mixing proportion, we can re-create Figure 4 across plausible parameter values. 
We can also plug observed values into Equation (4.5). Alternatively, we can count the proportion of 
bootstrap replicates in which the sign of the bootstrapped third moment differs from the observed 


sign and mz > 1. 
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Figure 4: Probability that sgn (4) # sgn(A) based on simulations (solid line) and the method of 


moments approximation in Equation (4.5) (dotted line); based on 7 = 0.25 and 1000 simulations 
at each set of parameter values 


5 Application to principal stratification 


We now motivate the use of finite mixtures in principal stratification. For our primary running 
example, we re-analyze the Job Search Intervention Study (JOBS IT), a randomized field experiment 
of a mental health and job training intervention among unemployed workers (Vinokur et al., 1995) 
that has been extensively studied in the causal inference literature (Jo and Stuart, 2009; Mattei 
et al., 2013). This is an example of one-sided noncompliance and is a simple but non-trivial 
example of the principal stratification setup. In the supplementary materials, we also re-analyze a 
randomized evaluation of JobCorps, the largest job training program in the US (Schochet et al., 
2008). We briefly discuss these results at the end of this section. 


5.1 Setup 


We begin with the canonical example of a randomized experiment with noncompliance, such as 
JOBS II, and set up the problem using the potential outcomes framework (Neyman, 1923; Rubin, 
1974). We observe N individuals who are randomly assigned to a treatment group, T; = 1, or 
control group, 7; = 0, with observed outcome, Y. For JOBS II, the primary outcome is a measure 
of depression six months after randomization. As usual, we assume that randomization is valid 
and that the Stable Unit Treatment Value Assumption holds (SUTVA; Rubin, 1980; Imbens and 
Rubin, 2015). This allows us to define potential outcomes for individual i, Y;(0) and Y;(1), under 
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Table 1: Summary statistics for observed groups in JOBS II 


Z D%”s Observed Mean Observed SD Possible Principal Strata 


1 1 -0.16 1.03 Compliers 
1 0 0.05 0.96 Never Takers 
0 0 0.14 0.99 Compliers and Never Takers 


control and treatment respectively, with observed outcome, Y,°P> = T;Y;(1) + (1 — 7;)¥;(0). The 
fundamental problem of causal inference is that we observe only one potential outcome for each 


unit. Finally, we define the Intent-to-Treat (ITT) effect as the impact of randomization on the 


outcome, ITT = E[Y;(1) — Y;(0)]. Throughout, we take expectations and probabilities to be over a 
hypothetical super-population. 

The main complication is that only 55% of those individuals assigned to treatment actually 
enrolled in the program. Let D; be an indicator for whether individual 7 receives the treatment, with 
corresponding compliance D;(0) and D;(1) for control and treatment respectively. For simplicity, 
we assume that only individuals assigned to treatment can receive the active intervention (i.e., there 
is one-sided noncompliance), which is the case in the JOBS II evaluation. Formally, D;(0) = 0 for 
all i. This gives two subgroups of interest: Never Takers, D;(1) = 0, and Compliers, D;(1) = 1. 
Following Angrist et al. (1996) and Frangakis and Rubin (2002), we refer to these subgroups 
interchangeably as compliance types or principal strata, U; € {c,n}, with “c” denoting Compliers 
and “n” denoting Never Takers. Table 1 shows the relationship between observed groups and 
principal strata. 


The two main estimands are the ITT effects for Compliers and Never Takers: 


ITT, = E[¥i(1) — ¥i(0) | Ui = c] = Her — Heo; 
ITT, = E[¥;(1) — ¥;(0) | Us =n] = nt — Hao, 


in which jy represents the outcome mean for U; = u and T; = t. We are primarily interested in 
ITT,, the impact of randomization on Compliers, which measures the impact of actually enrolling 
in JOBS II. Since we observe stratum membership for individuals assigned to treatment, we can 
immediately estimate wei and fini. Moreover, due to randomization, the observed proportion of 
Compliers in the treatment group is, in expectation, equal to the overall proportion of Compliers 
in the population, 7 = P{U; = c}. Thus, we treat 7 as essentially known or, at least, directly 
estimable. The main inferential challenge is that we do not observe stratum membership in the 


control group. Rather we observe a mixture of Compliers and Never Takers assigned to control: 
Ys | T; = 0 ~ wfeo(yi) + (1 — 7) fao(ya), (5.1) 


where fio(y) is the distribution of potential outcomes for individuals in stratum wu assigned to 
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control. 

The standard solution for this problem is to invoke the exclusion restriction for Never Takers, 
which states that ITT, = 0, or equivalently, Un, = Uno. Substantively, this states that the only 
impact of randomization on the outcome is by changing the intermediate variable, D. This is often a 
reasonable assumption, since actual program participation—rather than the randomization itself— 
is typically the important factor in practice. With this assumption, we can then estimate ITT, 
with the usual instrumental variables approach (Angrist et al., 1996). In JOBS II, however, there 
is a concern that randomization has a negative impact on depression levels for Never Takers (see 
Mattei et al., 2013). Thus, assuming that ITT, = 0 could lead to biased estimates for ITT.. 


5.2 Model-based estimation 


In a seminal paper, Imbens and Rubin (1997) outlined a model-based instrumental variables frame- 
work, proposing a parametric model for the outcome distribution conditional on stratum member- 
ship and treatment assignment, such as fut(y:) = NV (Hut, 021). While the exclusion restriction can 
strengthen inference in this setting, it is not strictly necessary. Instead, identification is based 
entirely on standard results for mixture models. 

Since Imbens and Rubin (1997), dozens of papers have used finite mixtures for estimating causal 
effects.? For one-sided noncompliance, we can write the observed data likelihood with mean-shifted 


standard Normal component distributions as: 


Lovs(0)= [J rob (yis ter) x I] 9 @-*)¢@isum) x 


i: Tj=1, D9PS=1 i: T;=1, D9PS=0 
LL (ros co) + (2 = 7) 6( ys; tno) 


where @ represents the vector of parameters and ¢(y;; 1) is the Normal density with mean jz and 
variance 1. In practice, we often relax the assumption of known, common variance. Since the 
observed data likelihood for individuals with 7; = 1 immediately factors into the likelihood for the 
Compliers and the likelihood for the Never Takers, we can directly estimate ci and fyi. With 
one-sided noncompliance, we can also directly estimate 7 among individuals assigned to treatment. 

The challenge is therefore to estimate fico and Uno via a two-component homoskedastic Gaussian 
mixture with known mixing proportion, 7.* See Mattei et al. (2013) for further discussion of 


parametric mixture modeling in this setting. 


3Some examples of other relevant papers are Little and Yau (1998); Hirano et al. (2000); Barnard et al. (2003); 
Ten Have et al. (2004); Gallop et al. (2009); Zhang et al. (2009); Elliott et al. (2010); Zigler and Belin (2011); Frumento 
et al. (2012); Page (2012); Schochet (2013). 

‘Note that there is a very small amount of information about a from the mixture model among those assigned to 
the control group. Given the other complications that arise in mixture modeling, we ignore this and regard 7 as if it 
were estimated directly from the treatment group. 
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5.3 Application to JOBS II 


We now turn to using the non-asymptotic results in Section 4 for estimation and diagnostics for 
JOBS II. We focus on a subset of N = 410 high risk individuals, with Ny = 278 randomly assigned 
to treatment and No = 132 to control. The finite mixture consists of the No = 132 individuals 
assigned to control with mixing proportion 7 = 0.45. 


Table 1 shows summary statistics for the three observed groups. We standardize the outcome 


by subtracting off the grand mean and dividing by 6 = \/7a?2, + (1 — 7)2,, the estimated within- 
component standard deviation under treatment. Based on the group means, it is clear that workers 
who are observed to enroll in the program have lower depression, on average, than those who do 
not. Note that the point estimates for G.; and Gy; are quite close, which is consistent with the 
equal variance assumption. 

First, we consider the expected performance of the mixture MLE based solely on the observed 
sample size and mixing proportion. Figure 5a gives the probability of pile up and sign error over a 
range of plausible values of A using the Normal approximation in Equation (4.5) and the observed 
JOBS II values of N = 132 and 7 = 0.45. The pattern is striking. For values of A < 0.5, the most 
likely estimate of the MLE is zero, regardless of the true value of A. If the MLE is non-zero, the 
probability of correctly estimating the sign of A is only slightly better than a coin flip. 

Second, we incorporate information from the mixture distribution itself. First, the observed 
second and third moments are m2 = 0.96 and m3 = 0.17 (after centering the mixture distribution). 
When we plug the observed values into the Normal approximations in Equation (4.5), the proba- 
bility of pile up is 0.63 and the probability of a sign error is 0.31. The corresponding probabilities 
based on the case-resampling bootstrap are nearly identical, 0.64 and 0.29 respectively. Thus, prior 
to any estimation, we believe that the probability of a pathological MLE is high. 

Figure 5b shows the observed likelihood surface for Equation (1.1) fit to the JOBS II data. 
The likelihood is unimodal and centered at zero, which is consistent with the univariate results 
in Mattei et al. (2013).° Given the high probability of pile up ex ante, our analysis suggests that 
we should interpret the MLE of Ame — 0 with caution. 


5.4 Application to Job Corps 


In the supplementary materials, we provide a detailed re-analysis of a randomized evaluation of 
JobCorps, the largest job training program in the US (Schochet et al., 2008). Following Lee (2009) 
and Zhang et al. (2009), we are interested in the impact of Job Corps on (log) hourly wages, which 


°We can see this using the summary statistics in Mattei et al. (2013). For the univariate model without the 
exclusion restriction, their Table 1 gives point estimates [ici = 1.96 and jini = 2.08 on the depression scale. The 
treatment effect point estimates are rigid = —0.206 and ITT, = —0.084, which imply fico = 1.96 + 0.206 = 2.166 and 
fino = 2.08 + 0.084 = 2.164. Therefore, Aw = 0. By contrast, the implied estimate for A from their bivariate model 
is A= 0.261, which is roughly three-quarters of a standard deviation on the depression scale. Finally, note that the 
model in Mattei et al. (2013) assumes unknown, unequal variances. 
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Figure 5: Quality of Maximum Likelihood Estimation for the finite mixture model in JOBS I, with 
parameters N = 132 and 7 = 0.45. Panels (a) and (b) show the probability of MLE pathology 
and expected bias of the MLE if non-zero; Panel (c) shows the observed likelihood for the JOBS II 
mixture, with a maximum at co = Mno- 


is a measure of job quality. This quantity, however, is only well defined for a certain sub-population, 
known as always employed individuals. This is a principal causal effect and is sometimes referred 
to as the Survivor Average Causal Effect (SACE). While more complicated than non-compliance 
in JOBS II, we can again formulate the question as estimating the component means in a Normal 
finite mixture model. We focus on a mixture of N = 3,371 individuals with 7 = 0.06. Thus, while 
the mixing proportion is relatively extreme, the sample size is considerable. 

Despite the large sample size, we continue to find pathological estimates from the Normal 
mixture model. First, based on the diagnostics we propose above, the probability of pile up is 
around one-third, which is surprising given the large sample size. Rather than find that Amie — 0, 
however, we estimate an implausibly large Ame — —4.5 standard deviations. This estimate is well 
outside outside the minimax bounds, A € [—2.4, 2.2], suggesting that bias might be substantial.® 
See the supplementary materials for additional analysis. In practice, the simplest explanation for 


these results is that the simple Normal mixture model in Equation (1.1) is a poor fit to the data. 


°Following Lee (2009), we calculate minimax bounds via trimmed means of the mixture distribution. Specifically, 
we bound puner1 via the mean of the 7 = 0.06 individuals with, respectively, the lowest and highest values of hourly 
wages, with similar bounds for prri. 
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At the same time, it is difficult to imagine a different parametric mixture model that would be a 


better fit. This suggests that parametric finite mixtures might not be an effective strategy here. 


6 Discussion 


We find that maximum likelihood estimates for component-specific means in finite mixtures can 
yield pathological results in a range of practical settings. These pathologies are particularly relevant 
for estimating causal effects in principal stratification models, which are often based on estimates 
of component means. Echoing previous work (e.g., Griffin et al., 2008), we therefore caution re- 
searchers on the use and interpretation of model-based estimates of component-specific parameters, 
especially for causal inference. 

First, we suggest that, whenever possible, researchers consider alternative approaches to infer- 
ence that do not rely on model-based estimation. In the context of principal stratification, these 
alternatives often rely on constant treatment effect assumptions or on conditional independence 
across multiple outcomes (e.g., Jo, 2002; Jo and Stuart, 2009; Ding et al., 2011). When such re- 
strictions are not possible, we recommend that researchers first compute nonparametric bounds (see 
Zhang and Rubin, 2003; Grilli and Mealli, 2008; Lee, 2009; Miratrix et al., 2018). 

Second, researchers might nonetheless be interested in leveraging parametric assumptions for 
estimation. In this case, we suggest that researchers use our results to assess the probability of 
pathological results for different parameter values. Similar to design analysis, these calculations can 
provide practical guidance on whether mixture modeling will yield useful inference. One possibility 
is to incorporate multiple outcomes, such as in Mattei et al. (2013). This can greatly improve 
inference; intuitively, the distance between components will be greater in multivariate space, in 
effect, giving larger A and easier separation (see also Mercatanti et al., 2015). 

Third, we have focused on maximum likelihood rather than Bayesian methods (Friihwirth- 
Schnatter, 2006). The Bayesian approach offers some distinct advantages over likelihood-based 
inference.’ For example, the Bayesian can incorporate informative prior information, which can 
be especially important in finite mixture modeling; see, for example, Aitkin and Rubin (1985); 
Hirano et al. (2000); Chung et al. (2004); Lee et al. (2009); Gelman (2010). Moreover, our concern 
about sign error is trivial in the Bayesian setting: the global mode is simply a poor summary of a 
multi-modal posterior. More broadly, the weak identification issues we highlight in this paper are 
not necessarily relevant to a strict Bayesian. Imbens and Rubin (1997) and Mattei et al. (2013), 
for example, characterize weak identification as substantial regions of flatness in the posterior, 


which increases uncertainty but does not lead to any fundamental challenges.* Nonetheless, we 


"The Bayesian approach also introduces some unique challenges that we do not address here, namely the label- 
switching problem (Celeux et al., 2000; Jasra et al., 2005) and the difficulty of specifying vague prior distributions 
for finite mixtures (Grazian and Robert, 2015). 

8Imbens and Rubin (1997) note that “issues of identification [in the Bayesian perspective] are quite different from 
those in the frequentist perspective because with proper prior distributions, posterior distributions are always proper. 
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argue that our results are highly relevant for Bayesians who are also interested in good frequency 
properties (Rubin, 1984). In the supplementary materials, we offer evidence that the pathological 
behaviors we document for the MLE also hold for the posterior mean and median with some “de- 
fault” prior values. In this sense, we conduct a Frequentist evaluation of a Bayesian procedure (e.g., 
Rubin, 2004) and find poor frequency properties overall. More generally, we agree that informative 
prior information can be a powerful tool for improving inference in this setting. Finding suitable 
priors for finite mixture models is a topic for future research. 

Going forward, we hope that the approach outlined here can serve as a useful template for 
studying the behavior of mixture model estimates in finite samples. Moreover, we considered only 
a very simple case in this paper; in the future, we plan to assess inference for much richer models, 
especially those common in principal stratification. Finally, we are actively exploring alternative 
estimation strategies, particularly those that more directly leverage Bayesian methods and that can 
give sensible point estimates. In the end, inference in the Twilight Zone is possible. But we must 


proceed with caution. 


The effect of adding or dropping assumptions is directly addressed in the phenomenological Bayesian approach by 
examining how the posterior predictive distributions for causal estimands change.” 
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Supplementary Materials for “Weak separation in mixture models 
and implications for principal stratification” 


A Robust estimation via method of moments 


Rather than use higher order moments as diagnostics, we can instead use the method of moments 
directly for estimation. Several recent papers have highlighted the attractive properties of method 
of moment estimators for general mixture models Anandkumar et al. (2012); Wu and Yang (2018). 
Applying these results, we show that the method of moments approach has similar asymptotic 
properties to the MLE but better finite sample properties; in particular, the method of moments 
is not susceptible to pile up. 

First, in the setting with known, equal variances in Equation (2.1), we have the following 
moment equations: 


m =E[Y] =p 
m2 = E[Y?] = 1+ cé? (A.1) 
1— 27 
ny3)] = 3 
m3 = OY | = cé 5 


where A = (1+ )0. Since there is no information in the first moment about 46, we consider two 
estimators based on the second and third moments:° 
1/2 1/3 

~~ 1-7 a 

Ong = ash ia| ; 


mo —1 


n 


lome| = 


Cc 


where mz and ™3 are the sample second and third (non-central) moments, respectively. First, the 
absolute value for |dm,| is necessary because there is no information about sign of 6 in the second 
moment. Thus, dm, is a natural estimator when 7 = 1/2. By contrast, when 7 € (0, 1/2), dm3 will 
estimate both the magnitude and sign of 6. 

The following result establishes that these estimators have asymptotic behavior similar to the 
MLE, as described in Theorem 2.1. 


Proposition 2. Given the formulations of estimators Ou and Ces for the setting of known equal 
variances (2.1), the following holds 


(a) (Asymmetric regime) When x € (0,1/2), then 


a 5mm —|dn|| = Op Cue (A.2) 
sup Oma =0,| = O, (n-¥8) : (A.3) 


bn€O 


°In principle, we could also consider a generalized method of moments estimator based on both the second and 
third moments, though this is less transparent than the estimators we discuss below. See Anandkumar et al. (2012); 
Hardt and Price (2015); Wu and Yang (2018). 
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Table 2: Summary statistics for observed groups in Job Corps 


Z §°s Observed Mean Observed SD Possible Principal Strata 
1 1 0.03 1.013 EE and EN 

1 0 — — NN 

0 1 -0.05 1 EE 

0 0 — — NN and EN 


(b) (Symmetric regime) When 7 = 1/2, then 


— [6n| 


sup [bm = O05 (n-¥/4) ‘ (A.4) 


bn€O 


where on is undefined when 7 = 1/2. 


__ While these simple estimators have the same asymptotic behavior as the MLE, neither Omg nor 
dm3 are susceptible to pile up. It suggests that the moment estimators under the simple setting of 
known equal variances are more robust than the MLE. 


B- Analysis of Job Corps 


B.1 Setup. 


Following Zhang et al. (2009), we use the principal stratification framework to define the impact of 
Job Corps on hourly wages. Let S be an indicator for employment, with corresponding potential 
outcomes $;(0) and $;(1) and observed employment status $9’ for individual i. We then define 
principal strata, U, based on the joint distribution, {5;(0), $;(1)}: 


EE if §,(1)=1,S,(0) =1 
JEN if Si(1) =1,5;(0) =0 
‘ ) NE if S,(1) =0,S;(0) =1° 
NN _ if S;(1) =0,S;(0) =0 


We are interested in the impact of randomization on the always employed strata, EE. This is some- 
times known as a Survival Average Causal Effect and is closely related to the idea of “truncation 
due to death” (see Zhang et al., 2009). Finally, following Lee (2009), we invoke the monotonicity 
assumption, which states that random encouragement to enroll in a job training program can only 
increase employment, 5;(1) > $;(0); thus the NE group does not exist.!? 

Table 2 shows the relationship between principal strata and the observed groups, based on Z and 
$°>s, Under monotonicity, we directly observe always employed individuals (EE) assigned to the 
control group. We can therefore directly estimate the average outcome for this group, WEE. We can 


‘While this simplifies the analysis and allows us to highlight the role of finite mixture modeling, Zhang et al. 
(2009) argue against this assumption. In particular, they argue that enrolling in a job training program might raise 
an individual’s reservation wage and, as a result, make that individual less likely to accept a lower paying job. We 
merely note that relaxing this assumption further complicates the analysis, since the mixing proportions are no longer 
identified non-parametrically. 
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also directly estimate the proportion of EE individuals via Tz = P[S | Z; = 0], the proportion of 
never employed individuals (NN) via 7jN = 1—P[S' | Z; = 1], and the proportion of the induced to 
employment individuals (EN) via 7gn = 1—7NN-—TeEE. Without additional assumptions, however, 
we cannot estimate pr, instead observing a mixture of EE and EN individuals. Consistent 
with Zhang et al. (2009) and Frumento et al. (2012), we therefore assume that log-hourly wages 
follow a mixture of Gaussians with known mixing proportion, as in Equation (1.1) in the main 
text. Note that this mixture is much simpler than the full model considered in Zhang et al. (2009), 
which accounts for some important additional complications. 


B.2 Diagnostics. 


We focus on a complete case subset used by Lee (2009) of N = 9, 145 individuals, with Ny = 5,546 
randomly assigned to treatment and No = 3,599 to control. The mixture model consists of the 
Ni, = 3,371 individuals assigned to treatment who are employed, with mixing proportion 7 = 0.06. 

Table 2 shows summary statistics for observable groups. We standardize the outcome by sub- 
tracting off the grand mean and dividing by Go, the estimated standard deviation for individuals 
assigned to control who are employed. This is also the standard deviation for EE individuals as- 
signed to control. Since hourly wage is only defined for employed workers, the rows with $°S = 0 
have undefined outcomes. 

Figure B.6a gives the probability of pile up and sign error over a range of plausible values of A 
using the Normal approximation in Equation (4.5) and the observed Job Corps mixtures parameters 
of N = 3,371 and 7 = 0.06. As in Figure 5a, pile up is a major concern, though the probability 
of a sign error is somewhat less ex ante, in part because the mixing proportion is much closer to 
0. Figure B.6b shows the bias of the MLE if the MLE is non-zero and the sign is correct. As with 
JOBS II, the bias can be severe. 

We can also incorporate the higher order moments of the mixture distribution. In this case, the 
observed second and third moments are m2 = 1.03 and m3 = —0.87, respectively (after centering 
the mixture distribution). Plugging the observed values into the Normal approximations in Equa- 
tion (4.5), the pile up probability of 0.34 and the sign error probability is 0.03. The corresponding 
probabilities based on the case-resampling bootstrap are nearly identical, 0.34 and 0.04 respectively. 

Figure B.6c shows the observed likelihood for the mixture model. The MLE is at Te = 0.09 


and pes = —4.40, which implies Amle — —4.49 standard deviations. This is clearly an extreme 
estimate. Transforming these estimates to $ per hour shows that 7@& = $8.24 per hour and 


~mle 


HN, = $0.09 per hour, which is far below feasible hourly wages in this sample. This estimate 


is also outside the minimax bounds, A € [—2.4, 2.2].™ There is also a local mode centered at 
pmlc = —0.01 and 7M, = 0.59, which implies A™* = 0.60 standard deviations. In units of $ per 


mle mile 


hour, this is Wg, = $7.47 per hour and fxs, = $13.64 per hour. While far more feasible than 
the global mode, these estimates are still worrisome, since it is unlikely that the group induced 
to employment by Job Corps would have hourly wages nearly twice those of the always employed 
group; see Figure B.6b. Regardless, the likelihood at the MLE is considerably higher than at the 
local mode, with —2 x (€(+0.60|Y) — ¢(—4.49|Y)) = 296. Taken together, these results suggest that 
maximum likelihood does not give practically useful results in this example. 


"Following Lee (2009), we calculate minimax bounds via trimmed means of the mixture distribution. Specifically, 
we bound pune via the mean of the 7 = 0.06 individuals with, respectively, the lowest and highest values of hourly 
wages, with similar bounds for prri. 
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Figure B.6: Quality of Maximum Likelihood Estimation for the finite mixture model in Job Corps, 
with parameters N = 3,371 and 7 = 0.06. Panels (a) and (b) show the probability of MLE 
pathology and expected bias of the MLE if non-zero; Panel (c) shows the observed likelihood for 
the Job Corps mixture, with a global mode and a local mode. The dotted line denotes equal 
component means. 
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In practice, the simplest explanation for these results is that the simple Normal mixture model 
in Equation (1.1) in the main text is a poor fit to the data. At the same time, however, it is 
difficult to imagine a more plausible parametric mixture model in this setting. Thus parametric 
finite mixtures might not be an effective strategy in this example. 


C Validating the Normal approximations 


We present figures testing the agreement of the moment-based Normal approximations with their 
corresponding pathologies assessed via simulation. Figure C.7 compares the incidence of pile up and 
mg <1 for a range of values of 7, A, and N. The blue line indicates the probability the method 
of moments estimator indicator of pile up (1{mz2 < 1}) agrees with whether or not pile up was 
observed in simulation. The results are averaged over 1000 simulated data sets. Unsurprisingly, 
the correspondence improves as N increases and is worst when 7 = 0.1, the case in which the 
mixture is its most asymmetric. Overall, however, the Normal approximation provides an excellent 
estimator for whether pile up has occurred in the sample. 

Figure C.8 shows the corresponding plots for assessing the sign of A. Here, due to the extra 
noise in m3, the correspondence is much less sharp. The discrepancies are most noticeable when 7 
is close to 0 and A is small. 


D_ Confidence sets via inverting tests 


Given the poor performance of the MLE, we are interested in methods that perform well even 
when A is small. Based on the large literature on weak identification in other settings, we presume 
that many such methods are possible. As a starting point, we suggest an approach to construct 
confidence intervals based on inverting a sequence of tests. This approach is widely used in other 
weak identification settings, namely weak instruments Staiger and Stock (1997); Kang et al. (2015) 
and the unit root moving average problem Mikusheva (2007). It is also closely related to the 
method of constructing confidence intervals for causal effects by inverting a sequence of Fisher 
Randomization Tests Rosenbaum (2002). 

At the same time, this approach has its drawbacks. First, while test inversion yields confidence 
sets with good coverage properties, it does not necessarily yield good point estimates. In particular, 
it is possible to construct a Hodges-Lehmann-style estimator via the point on the grid with the 
highest p-value Hodges and Lehmann (1963). But since pile up and sign error remain issues, any 
point estimator in this case should be interpreted with caution. Second, the coverage guarantees 
hold only when the model is correctly specified; under even moderate mis-specification, the resulting 
estimator can cease to exist Gelman (2011). Importantly, the MLE performs poorly even when the 
model is correctly specified. Alternatively, researchers uninterested in test inversion for confidence 
intervals might nonetheless be interested in using this approach to assess model fit. If the proposed 
procedure rejects everywhere, this is evidence that the Normal mixture model is a poor fit. 

We discuss two basic approaches here. Our first approach is a version of the grid bootstrap 
of Andrews (1993) and Hansen (1999), which generates Monte Carlo p-values by simulating fake 
data sets from the null hypothesis. While the grid bootstrap is conceptually straightforward and 
enjoys theoretical guarantees Mikusheva (2007), it is also computationally intensive. Our second 
approach is therefore a fast approximation that directly uses the Normal sampling distribution 
in Equation (4.5) of the main text to derive a y? test at each grid point. To demonstrate these 


32 


A=0.27=0.1 A=0.47=0.1 A=0.67=0.1 A=0.87=0.1 A=17=0.1 


Q -]|————— 9 | _———— ‘9 —_————— 9 | ———— 0 
a © a © a © a © a © 
® ® ® ® ® 
o FTTTt o TFTTTt o FTTTt o FTTTt o FTTT 
100 400 100 400 100 400 100 400 100 400 
N N N N N 
A=0.27=0.2 A=0.47=0.2 A=0.67=0.2 A=0.87=0.2 A=12=0.2 
iw 7 SS »o 7 ob Do ee 0 —e—=——"——- 
lop) lop) lop) 
33 33 ae 33 33 
lo] lo] lo] So io] 
ioe) oe) oe) oe) oe) 
o TTTT1 o TTTTl o -TTTTl o -TTTTl o TTTTl 
100 400 100 400 100 400 100 400 100 400 
N N N N N 
A=0.27=0.3 A=0.47=0.3 A=0.67=0.3 A=0.87=0.3 A=17=0.3 


p 
0.80 0.95 
Litt 
' 
1 
' 
i} 
i} 
p 
0.80 0.95 
LLtit 
' 
' 
' 
' 
1) 
U} 
p 
0.80 0.95 
Litt 
' 
' 
' 
' 
1 
i} 
p 
0.80 0.95 
LLtit 
| 
i" 
p 
0.80 0.95 
LLitts 


CTT? CTTwT CTT? rTTwT rTTw. 
100 400 100 400 100 400 100 400 100 400 
N N N N N 
A=0.27=0.4 A=0.47=0.4 A=0.67=0.4 A=0.87=0.4 A=17=0.4 
.Q |—— 0 9 -]|—— On wo 
ror) ror) ror) 
33 33 ae 33 33 
=) co) co) co) =) 
ce) ce) ce) ce) ce) 
ro) ro) ro) ro) ro) 
100 400 100 400 100 400 100 400 100 400 
N N N N N 


Figure C.7: Probability that the diagnostic based on the second moment (1{7m2 < 1}) agrees with 
whether or not pile up was observed in simulation. The dotted red line perfect correspondence at 
each tested N. The blue line is the average agreement probability over 1000 simulated data sets. 
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Figure C.8: Probability that the diagnostic based on the third moment agrees with whether or not 
the wrong sign pathology was observed in simulation. The dotted red line perfect correspondence 
at each tested N. The blue line is the average agreement probability over 1000 simulated data sets. 
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Figure C.9: Berry-Essen bound for probability of pile up for 7 = 0.35 and a range of values of N 
and A. 


methods, we first outline inference for A alone and then extend this to inference for the component- 
specific means, fio and p21. 


D.1 Overview of grid bootstrap 


To conduct a grid bootstrap, we first need a grid. Define A = {Ao, Ai,..., An} with A; > A; for 
i > j. The immediate goal is then to obtain a p-value for the following null hypotheses for each 
value A; € A: 

Hy: A=A; VS. Ay :AFA;. (D.1) 


For convenience we first center the data (i.e., we set 4 = 0 as in the main text). Next, we need a 
test statistic, t(y, Aj), that is a function of the observed (or simulated) data and the value of A 
under the null hypothesis, A = A;. For a given N, and initially assuming 7 and o” are known, we 
then obtain exact p-values through simulation with the following procedure: 


e For each Aj cA 


— Calculate the observed test statistic, ty = t(y”, A;). 


— Generate B data sets of size N from the model 
ii Ay A; 
y; Mo oN (F.0") +(1-m)N (- 2.0") ‘ 


— For each simulated yj, compute tj = ty}, Ag). 


30 


— Calculate the empirical p-value of tH asa function of the null distribution, ty. 


e Calculate the confidence set, CSg(A) = {Aj : p(A;) > 1—a} for a specified significance level 
a, where p(A,;) is the empirical p-value of A™!® assuming that A = Aj. 


Note that the resulting confidence set might not be continuous, which could occur if the sampling 
distribution is strongly bimodal. 


D.2 Constructing a test statistic 


So long as the model is correctly specified, this approach yields an exact p-value for any valid test 
statistic, up to Monte Carlo error Mikusheva (2007). We propose a test statistic based on the 
joint distribution of M2 and m3.'* Equation 4.5 suggests a natural combination of the estimated 
cumulants: 


tly Aq) = (dz, dz) Var(m2,m3)~* (dz, ds)", (D.2) 


where dy = mz — mM , and we use the assumed null of A = A; to obtain (mz,m3) and Var(mz, m3). 
As we saw, the Normal approximation in Equation (4.5) in the main text is excellent, even for 
modest sample sizes (say N > 100). This implies: 


tly; Aj) a x3. 


We can therefore obtain a p-value via a Wald test, rather than via simulation, at each grid point, 
which is much faster computationally. 

Finally, to use these approaches to estimate component means, we need to (1) expand the grid, 
and (2) expand the test statistic. A natural choice for a grid of points is the two-dimensional 
grid over fio and pu. To expand the test statistic, we directly use the first three cumulants from 
Equation (4.5) from the main text and from Tan and Chang (1972) to obtain a joint test statistic 
as in Equation (D.2): 


tm(y, Aj) = (di, do, d3)Var(K1, K2, 63) (di, do, ds)? ~ x3. (D.3) 


As above, we can obtain p-values via the grid bootstrap rather than via the x? distribution. 
Figure D.10 shows the distribution of p-values for three different examples from the same data 
generating process, with N = 1000, 7 = 0.325, o? = 1, Lo = +4, MW = —%.18 

Figure D.11 shows the 95% coverage for the confidence sets obtained through this fast ap- 
proximation. As expected, the coverage is essentially exact. In particular, 95% coverage for this 
procedure is far better than the corresponding coverage based on the MLE. 


D.3 Grid bootstrap for principal stratification model 


In the full principal stratification model, we directly estimate the outcome means for Compliers 
and Never Takers assigned to treatment, fic, and /in;, and use the finite mixture model to estimate 


"There are many possible alternatives. For example, Frumento et al. (2016) suggest test statistics based on scaled 
log-likelihood ratios. Another option is to use univariate test statistics based on m2 or m3. 

'3Note that the y? distribution no longer holds when po = 1. While we can use a univariate Normal distribution 
to obtain a valid p-value in this case, this additional complication is generally unnecessary in practice. 
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Figure D.10: Three examples of the grid of Wald test p-values from Equation D.3. The three 
simulated data sets were drawn from Equation (1.1) in the main text with N = 1000, 7 = 0.325, 
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Figure D.11: Coverage for 95% confidence sets based on the test inversion algorithm described in 


Section D. The results for the MLE are for the standard finite mixtures estimator. 
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corresponding outcome means for Compliers and Never Takers assigned to control, fico and fino. 
Our goal is inference for ITT. = fic1 — fico and ITTy = pint — fino. While this is straightforward 
given estimates for eo and [{y9, we only have confidence sets for these means. 

We therefore propose the following approach to obtaining (1 — a)100% confidence sets for ITT; 
and ITT): 


e Use a grid bootstrap or test inversion to obtain a joint (1 — a/2)100% confidence set for [Wo 
and fy9, which we can project into univariate confidence sets, CSy/2(Mco) and CSq/2(tno) 


e Directly obtain (1 — a/2)100% confidence intervals via the Normal distribution for jic1 and 
Ln; CSq/2(He1) and CSq/2(Hn1) 


e For ITT, (repeat for ITT,): 


— If CSq/2(uco) is not disjoint, obtain a (1 — a)100% confidence interval for ITT-<: 


CSZ? ITT.) = C8875 (Her) — CSZ75 (Heo) 
CSQ7 ITT.) = C8375 (Her) — C82 75(tteo) 


— If CSy/2(Meo) is disjoint, repeat the above calculations for each separate segment and 
then take the union 


This yields valid confidence sets for both treatment effects of interest. If desired, we could 
incorporate an additional Bonferroni correction to account for the two separate intervals. 

Finally, if desired, we can extend this procedure to account for uncertainty in 7 and a, which 
are nuisance parameters for the desired hypothesis tests. We can therefore use results from Berger 
and Boos (1994) to obtain valid p-values in this context. First, we obtain a (1 — ¥)-level joint 
confidence set for C'Sy(z, o”), such as via case-resampling bootstrap, with y very small, such as 
7y = 0.001. We obtain a valid p-value for, say, A, by taking the maximum p-value over C'S,(7, a?) 
plus a correction for the added uncertainty: 


py(Ao) = sup p(Ao) +7 
(7,07)ECS4 (1,07) 


See Nolen and Hudgens (2011) and Ding et al. (2016) for further discussion of the validity of this 
approach. 


E Failure of resampling methods 


Resampling methods, such as the case-resampling bootstrap, are common in finite mixture model 
settings. For example, McLachlan and Peel (2004) recommend using the bootstrap to improve 
estimation of standard errors when the Fisher information yields a poor approximation Griin and 
Leisch (2004). Others have suggested subsampling in similar settings Andrews (2000). Figure E.12 
shows the coverage for 95% confidence sets based on the case-resampling and subsampling intervals. 
Clearly, the coverage is far from nominal. 

The form of A™°™ shows why the performance of these methods is so poor. As the work Bickel 
and Freedman (1981) proved, for the bootstrap to be consistent in the iid context, the mapping 
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Figure E.12: Coverage probabilities for 95% confidence sets based on the case-resampling and 
subsampling intervals. The blue line represents the case-resampling coverage probability, while the 
blue line represents the subsampling coverage probability. 
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from the underlying distribution of the data to the distribution of the statistic must be continu- 
ous Andrews (2000). Clearly, 

zs ie mz —1 

A™mom — 

sen(tia) m(1 —7) 

is not a continuous mapping from the sample to Amom with a boundary at m2 > 1 and a disconti- 
nuity at m3 = 0.'4 In the related case of the unit root problem, Mikusheva (2007) shows that other 
resampling methods also fail, including subsampling and the m of n bootstrap. In the context of 
principal stratification, Zhang et al. (2009) note that confidence intervals based on the bootstrap 
often fail when the likelihood is multimodal. Frumento et al. (2016) offer additional discussion in 
this setting. 


F Frequency Performance of the Posterior Mean and Median 


Bayesian inference for finite mixtures introduces some unique challenges for specifying priors (e.g., 
Grazian and Robert, 2015). Nonetheless, inference for a posterior with a sufficiently vague prior 
should be broadly similar to inference based on the likelihood alone. Thus, without an informative 
prior for {j19, 41} in the two-component Gaussian mixture, the posterior mean and median should 
exhibit similar pathologies to those exhibited by the MLE. We test this intuition using the bayesm 
package in R. Figure F.13 shows histograms of the posterior mean of A when the true A is 0.5 
and 1, 7 = 0.3, and N = 100. We use the default priors of the bayesm package except in the case 
of the Dirichlet parameter, which is set to reflect that 7 = 0.3 is known (i.e., we assume a very 
informative prior). The histograms exhibit the same behavior as the MLE of A. In particular, the 
estimator concentrates around 0 and seems unable to differentiate between A > 0 and A < 0. 
Figure F.14 shows the corresponding plot for the distribution of the posterior median of A. As 
we can see, the median also concentrates about 0 and appears unable to determine the sign of A. 


Tn some promising recent work, Laber and Murphy (2011) explore bootstrap-type methods with non-continuous 
mappings. We hope to explore this more in the future. 
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Figure F.13: Histograms of the posterior mean for A calculated via MCMC draws from bayesmn. 
The histogram on the left is for A = 0.5, while the histogram on the right is for A = 1. Both 


histograms have N = 100, 7 = 0.3, and o = 1. 
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Figure F.14: Histograms of the posterior median for A calculated via MCMC draws from bayesmn. 
The histogram on the left is for A = 0.5, while the histogram on the right is for A = 1. Both 


histograms have N = 100, 7 = 0.3, and o = 1. 
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G Proofs 


In this appendix, we provide detailed proofs for the key asymptotic results in Section 2. We first 


mle 
On 


start with the proof regarding convergence rates of gimle and under the asymmetric and 


symmetric setting of model (2.1). 


G.1 PROOF OF THEOREM 2.1 


Throughout this proof, for the ease of presentation, we denote 
g(x, 0) = T(x, —6) + (1 > 1) (x, cd), 


for any 6 € © where {¢(x,6)} denotes the family of Gaussian distribution with location parameter 
6 and scale is fixed to be 1. Additionally, we also remind that c = 7/(1— 7), with this quantity 
thus being a known constant. To streamline the argument, we divide the proof into two parts. In 
Section G.1.1, we provide the proof for the upper bounds of the convergence rate of MLE. Then, 
in Section G.1.2, we present the proof for the lower bounds. 


G.1.1 Proof for upper bounds 


The proof technique for the upper bounds utilizes the strategy of comparing the convergence rate 
of density estimation to that of parameter estimation in mixture models, which had been employed 
successfully in the previous work Chen (1995); Nguyen (2013); Ho and Nguyen (2016); Heinrich 
and Kahn (2018). 


Convergence rate of density estimation The convergence rate of density estimation in Gaus- 
sian mixture models had been studied rigorously in the literature Ghosal and van der Vaart (2001). 
Regarding our model (2.1), we have the following result regarding the convergence rate of g(z, gmle) 
to g(x, 6,) under Hellinger metric. 


Proposition 3. Under the setting of model (2.1), the following holds 


zi 


n 


sup Bs, (1 (g(e, di"), a(¢,5n))) 5 ( 


bn€O 


where © is a bounded (growing) parameter space. Here, Es, denotes the expectation taken with 
respect to product measure with mixture density of Y1,...,Yn under the model (2.1). 


The proof of the above result follows from a standard application of Theorem 7.4 in van de 
Geer (2000); therefore, it is omitted. 


From density estimation to parameter estimation Equipped with (logn/n)!/? rate of den- 
sity estimation in Proposition (3), to achieve the convergence rates of gmle and a under the 


asymmetric and symmetric setting of model (2.1), it is sufficient to demonstrate the following result: 


Lemma G.1. Given a € (0,1/2] and O = [—1,1], the following holds 
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(a) (Asymmetric regime) When x € (0, 1/2), then 


3 
‘ (1) (2) (1) _ 5(2) 
ce” (o(2,9 ), g(x, 6 )) / 3 é | > 0. 
(b) (Symmetric regime) When x = 1/2, then 


2 


h (s(e,5), a(x, 6) j]]6| — aI] > 0 


inf 
5) 62 €@ 


Proof. (a) Due to the basic inequality between total variational distance and Hellinger distance 
h > V, it suffices to prove that 


inf V (o(2,3), g(x, 6)) /|5Y — 5/3 > 0. (G.1) 
5H), 62€0 


Assume that the conclusion of (G.1) does not hold. It implies that we can find two sequences {on} 
and {on} such that V (g(a, 5), g(x, 6) a? - §2))3 — 0 as n — oo. For the simplicity of 


the presentation, we only the consider the most challenging setting of sequences {on} and {on} 
when 5) > 0, 52) —+ 0asn- oo. The proof for other possibilities of these sequences can be 
(1) 


argued in the similar fashion. Now, we have two distinct cases regarding the convergence of 67, 
and 52) 


Case a.1: gh) Fi §(?) 7 1asn— co (Here, the limit can be thought as that of some subsequence of 


a) / 6). However, we replace this subsequence by the whole sequence of gi) ij §(?) for the simplicity 
of the presentation). Under this case, we divide our argument into several steps. 


Step 1 - Taylor expansion Now, the following equality holds 
g(x, 6 ae g(x,5n) _  m(G(a,—8h) — 4(@, -8)) 
lan? — 7 \ax? — 5/8 


(L=m)(o(a, c6?) — (a, c6?)) 
T 5) = 523 . 
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Invoking Taylor expansion up to the third order, we obtain that 


3 s(2) _ s(1)ya 9a 
(0,80) — 92-62) = YOR OC, 90) 4 a(n), 


(5h) — 5y)* Oh 


3 
P(e, c5\)) ~~ b(z, c6)) = bs (x, c62)) + Ro(x) 


= a! O6® 
3. a a (37a T T 9a+T 

— Sp SP = (FS (e+ WG OS. _gi0 
a a! = 7! agetr >? 


+R, (2) + Ro (ae), 


where R(x), Ro(x) are respectively the Taylor remainders up to the third order from performing 
Taylor expansion around ape and c5 2) while R24 are Taylor remainders up to the order 3 — a 


from performing Taylor expansion around —62 in 72 (0, 050) as 1 <a<_3. Here, the Taylor 
remainders R(x) and Ro(x) satisfy 
max{ || Ri (2)[loo;|| a(x) loo} = O (8M — 53+") , (G.2) 


where y > 0 is some positive constant. It implies that R, (x) /|5Q? - 602) 3 — 0 and Ro(x) /\os» — 
5)|3 + 0 for all c ER. Similarly, ||Ro,a(x)|loo = O(|62 8-247) as 1 <a <3. As 667/62) A 1, 
we have \5)| 16? - 60?) 7 +00. Therefore, we have )52) jr-ety 115) - 5) jr—a +> O0asn—> oO, 
which eventually leads to 


(68) — 6?))*|| Ro,o(2)lloo/|64 — 62)|? + 0 (G.3) 
for all 1 < a < 3. Governed by the previous results, the following representation holds 


( 3. (5h) — bn )* % 
TT 


(x, —6) 4 ra(«)) 


g(,6.)—g(t,6) _ "VE al 5 
Jon) — bn) Jon) — bn 8 
3 0% (5) a 52)ya 3-a (c+ 1)7(62))7 OT d (2) 

on gee oO) + Raole)) + Bale) 
T 5) = 62) |3 

3 Ou 

x Ana gee (ey 5?)) + R(x) 

a=1 

= (G.4) 
Jon? — bn 


3 ais) _ (2)\a 
where R(x) = 7Ri(x%)+(1—-7) > a Se a 


Roo(%)+(1—7) Ro(x) for all x € R. Invoking the 
a=1 a. 


bounds with Taylor remainders R)(x), Ro(x), and R2q(x) in (G.2), (G.3), we have R(x) loo /|5Q? — 
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62) )8 >0asn—> oo. 

Step 2 - Non-vanishing coefficients Assume that the coefficients An q/ 5) - 52))8 + 0 as 
n — oo for all 1 < a < 3. From the formulations of Ay,q in (G.4), we can quickly compute that 
An, = 0 while 


Cc 


Ana = 5 (On — dR )(5n + dy”), 


(52) _ 63 
An,3 = 
; 3! 


+ (1 —m)e(c 4 1)2(62))2 Gn = On) 


(5) _ 52))2 


+(1—m)(e+ 1)? 6%) + (1—-me 


As An2/|65 —6)|3 — 0, it implies that (50 +6) /J6? —5|2 — 0, which leads to 68? /62) + -1 


as n — oo. Plugging this limit into An,3/ \5Q) 62) 3 — 0 yields the following equation 
8(1 — 7)? 
BF _ (1 — mole-+ 1)? +2(1— met(e+1) — SC= ME _o, 


which has only a unique solution 7 = 1/2, a contradiction to the assumption of asymmetric setting, 


ie., t € (0,1/2). Therefore, not all the coefficients An,a/|68” _ 62))3 — 0 when n — oo as 
Ll<a<3. 


Step 3- Fatou’s argument Denotem, = 56) 5) 8 / max, |An,q|. Since not all the coefficients 
Sas 


An,o./|6S» — 52) )8 +>0as1<a <3, we have m, 4 oo. Therefore, we obtain that 


3 Od (2) 
An Qa oy xL,—On +R z 
m g(x, 50) — g(x, 50) -— x : age | ) (x) 003 
16h) — aps " 5 — Pp oe 


for all x where Apnq/ max, [Anal > Ba as 1 < a < 3 such that at least one of 6g has absolute 
a 


value to be 1. Invoking Fatou’s lemma, the following holds 


mnV (g(x, 9h), 9(2,5n”)) pn gla, dn) — g(x, 50”) 


= [5 sO 2 | TB a) 5@s 


= [So aFle0 


Od 
06 
the strong order identifiability of Aseation Gaussian distribution Chen (1995), the above equation 
implies that 6. = 0 for all 1 < a < 3, which is a contradiction. Therefore, Case a.1 cannot holds. 


The above inequality leads to > Pa=~—(x,0) = 0 for almost surely x. Nevertheless, due to 
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Case a.2: 5) 1562) —+ las n — oo. It implies that 5) 6) - 6) + ooasn— oo. As 
V (a(x, 30), 9(2, bn) /\6s) — 62) )8 — 0, it implies that 


V (g(a, 3), (a, 6)) 10 - 1)? > 0, 


as n — oo for all x € R. Similar to the Taylor expansion argument in Step 1 in Case a.1, by means 
of Taylor expansion up to the second order, we obtain that 


g(a, 60) — g(a, 69) _ (d(x, -68?) — d(x, -6)) + (1 — 2) (b(a, cK”) — G(x, c6?)) 
2 


\56)) _ és )j2 bic 602) 2 


2 (62) — 6) arg (2) ‘ 
“(o a! aoa eae ) 


\5) = 5 (2) 2 
a s(1) (2)\a of ri c(2) 
c%(6. 6. 2-a (c-+1)7(d5’)7 0°17 H 
a) en Fr (t-8) + Raa) ) + F(a) 
\5) = 5 (2) 2 
2 
% Ano seo(t 8) + Ro) 
=< > 0, 


\5)) = 6 (2) 2 


where ||R’(x)|loo = O (lan? | +160? — 51) for some y > 0. By means of the calculations with 
An,q in Case a.1, we have 


1+y 
OEP 6? 0? ) 
|R(@)Ihoo _ ( | i: 
|An,2| | §2) _ 5) | | 6) 4 5) 


Now, i as abn" — 62) — 0 for all 1 <a < 2, we have \5) + 6) 5 = 52) — 0, which implies 
that 5 ) 50 — —1, a contradiction to the assumption of Case a.2. According to the argument in 
Step 3 in Case a.1, by denoting m/, = \5) - 50?) 2 / max, |An,al, we have m!, 4 oo. Therefore, we 
have ie 


(1) (2) 2 
1 9(2, On) — g(x, On”) Ono 
mi, ro Dp » 2, Te gga (#40) 


for all x for some coefficients T, such that at least one of them has absolute value to be 1. By 
virtue of Fatou’s lemma in Step 3 in Case a.1 with lim V (s(e, 5), g(x, in )) tse — 5?) 2, we 


(ea 


0 
hi that a 
achieve tha » Ta B50 


Gaussian distabution implies that tT. = 0 for all 1 < a < 2, which is a contradiction. Therefore, 
Case a.2 cannot happen. 


——(x,0) = 0 for almost surely x. However, the strong identifability of location 
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Combining the results from Case a.1 and Case a.2, we achieve the conclusion of (G.1). As a 
consequence, the conclusion of part (a) of Lemma G.1 follows. 

(b) Similar to the proof strategy of part (a), to obtain the conclusion of this result, it is sufficient 
to demonstrate that 


2 


V (9(2,6), (x, 6))) (oI ~|6) || >0. (4.5) 


inf 
5) 62) 665 


Assume that the conclusion of (G.5) does not hold. It implies that we can find two sequences 


{on} and {on} such that 


2 
V (9(@,6§?), 9(2.8?)) / a — |62)|] 0 


as n — oo. Similar to the proof argument of part (a), we only consider the possibility that 5) > 0 


and 62) > 0asn— oo. Now, we have two different settings of 6) and 52), 


Case b.1: 5D 152) *# 1lasn— oo and §ee) > 0 for all n (Here, the limit and the inequality 
can be thought as those of some subsequence of gh) and 52). However, we replace this subsequence 


by the whole sequence of 5) and §(2) for the simplicity of the presentation). Under that setting, 
we have 


V (o(@, 58”), 9(@,88)) — V (g(@,6h?), gw, 6?)) 


5 > 0. 
iw ee 


To ease the understanding, we divide our argument for Case b.1 into two separate steps. 


Step 1 - Taylor expansion By means of Taylor expansion up to the second order as that of 
Case a.2 in the proof of part (a), we obtain that 


2 O%o 2) 

Ana ’ n t i 
(1,6) — 90,8) 2 apa On) + Ha) +0 
Jon) — 62 5) — 5? 


where R’(a) is a combination of Taylor remainders such that 
1+ 
||R’(x)||oo = O ({at> 7 a 7 52) . 


for some positive constant y and A, are defined as in that in Case a.2 when 7 = 1/2. Since 
5D) 1562) 7 1, we have 5) 6? — 5) # oo. Therefore, it leads to 


|| R(z)|]o0/|5? — 6)| 0 


AT 


as n—- ©. 


Step 2 - Non-vanishing coefficients and Fatou’s argument Assume that Aj q/ se _ 


60?) 2 — 0 for all 1 <a < 2. From the formulation of An.2, we have 
(6D) 4 62) /|5Y — 62)| > 0. 


It implies that Ae) / 6(2) — —lasn— o, which is a contradiction to the condition that 6M 52) > 0. 
Therefore, not all of the coefficients of Anq/ 500) —§2) |? go to 0. From here, by means of the Fatou’s 
argument in Step 3 of Case a.1, we achieve the conclusion that Case b.1 cannot hold. 


Case b.2 50) 162) / 1 and jige) < 0 for all n. Under that setting, we have 


V (g(x, 68”), g(x, 5s”)) V (9(x,08?), g(x, 8”)) 0 
_ 5) > Uz. 
eee 


2 
| — a 


We also divide the argument of Case b.2 into two main key steps. 


Step 1 - Taylor expansion By means of Taylor expansion up to the second order, we obtain 


1 1 
g(x, 52) — g(x,62) 5 (¢(@ dn) ~ 8(@, bn?) + 5(G(@, dn) — O(@, dn”) 
[5D 4 602) 2 — 5) + 622 
1 z (-62) zu 5x) ye O“d (2) i 
2 & al apa )) 


Jos) + 32 
1 2 6) 6 (2) a 2-a QT —§2) F OOtT 
5 (3 Se (od) + Bal) + 252) 


a=1 a! 7T=0 7! wae 
| J) + 62 
2 leu 
E Bra gael 50) + RMa) 
== > 0, 


where R(x) is the combination of Taylor remainders such that 
R"(2)|hoo = O ([6P/* 7150 + 61) , 


which implies that || R(x) |loo/|66? + 62)|? + 0 as n > ov. 
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Step 2 - Non-vanishing coefficients and Fatou’s argument Assume that B,q/ 50) 


5?) 2 — 0 for all 1 <a < 2. Direct computation with B,,2 implies that 


(64) — 62))/|6) + 62)| > 0 


as n > oo. It leads to a ) 152) — 1, which is a contradiction to the assumption that 6M) 52) < 0. 
From here, the Fatou’s argument in Step 3 of Case a.1, we also obtain the conclusion that Case b.2 
does not hold. 


Case b.3 6) i 62) + lasn— o. This implies that 6) 52) > 0 when n is sufficiently large. 
From here, the proof argument of this case is similar to that of Case a.2 in part (a), which also 
yields the contradiction. 

As a consequence, we achieve the conclusion of part (b) of the lemma. 


G.1.2 Proof for lower bounds 
(a) Based on the proof technique of Theorem 3.2 in Heinrich and Kahn (2018), to achieve the 
conclusion with the lower bound of part (a) of the theorem, it is sufficient to demonstrate that 


5@) _ 5 


n(g (2,6), g(x, 5) =0 (G.6) 


inf 
5), 5201» 


for any 1 < r < 3. We divide the proof argument for the above result into several key steps. 


Step 1 - Constructing sequences In fact, we construct two sequences {on} and {on} such 


that Fe) = =§%) for all n > 1 and 6) —+ 0 as n-— oo. For any fixed r < 3, by means of Taylor 
expansion up to the second order as that in Step 1 of Case a.1 in part (a) of Theorem 2.1 (cf. 
Equation (G.4)), the following holds 


2 fags 
ales) ~ 90,69) = 3 Anapee(t, 82) + Ra), 


where R(x) is a combination of Taylor remainders where its detail formulation is postponed to later 
discussion. Additionally, the formulations of An satisfy An, = 0 and 


Ano = 5 (On? — dR )(5R + 4p?) = 0 


AQ 


Step 2 - Hellinger bound and Taylor remainders Equipped with the above results, we have 


2 
h? (g(a, dh”), g(a, d8”)) (9(@, 88) - g(x, 60”)) 
GQ) _ 5@)/ =) or ade 
one ri (v/a, 08) + yla(e,5%)) 
2 


(R(2)) . 
fe ¢ 2r (Yate) + Yates) 


To validate that the above term goes to 0, we will need to investigate the concrete formulation of 
R(x). In particular, the formulation of R(x) is 


2 oo (5) = 52))a 
R(x) = wRi(x) + (1-7) 5) —*— ~~ Rp,.(x) + (1 - 2) Ra(a), 
a=1 


a! 


where the formulations of Taylor remainders Rj(x), Ro.(x), and Ro(x) are as follows 


1 
Ry(x) = J eg ye oa 6) 4 (62 50) ) dt, 
0 
3c (6 Ni : 
Ro(x) = fan pre ( av, cb?) +t (cal — c5(?))) dt, 
0 
(Bayle * (P) on 
ae = *( fo- 1) @ 8 (0,6 + H0-+ 1)60) a 


0 


for any l<a< 2. 


Step 3 - Taylor remainders control Now, Holder’s inequality leads to 


es W)) 1 


Ri(«) a om (SS aps (2-52) +4 (5 -3(0))) at 
0 


Due to the formulation of location Gaussian kernel with variance 1, we can check that 


j (on o-# 1(a a) 


o(a, —5,"’) 


sup dx < oo. 


te [0,1] 
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Equipped with the above results, the following holds 


| R3(2) af Rio 
art |)” (vat (x, 8) + /g(e =i) art |5Q)| mo(a, —89?) 


* iG (G.7) 


2 
as n — oo where the first inequality is due to the inequality (V/ g(x, 5?) + V a(e. 3) ) > 


Tox, —§2)), By means of the similar argument, we also obtain that 


| 3) sie f _F3(2) e 
ori lay (vote 6) + Vola, a?) art ja (1— )4(@, cbn”) 


< lor” 4 0, 
/ - - me ) ae | (8) — 60)" Fae) ' 
Qr-1 lon ” (oe 68 )+ + Va (x, 62 ») oat lay" mo(x,—69) 
Z sey” 4) (G.8) 


Invoking Cauchy-Schwarz’s inequality, the following inequality holds 


2 ag) _ =a . : 
R?(a) <3 | (tRi(2) (l—7 2a —————— Ro o(x)} +((1-m)Ro(x))"]. (G9) 
Combining the results from (G.7), (G.8), and (G.9), we achieve that 


[ eo (2-8 1 (yata,8) + ox, 6) }) Jar 0 


As a consequence, we achieve the conclusion with the lower bound of part (a) of the theorem. 
(b) Similar to the proof argument of part (a), to achieve the conclusion of the lower bound of 
part (b), it is sufficient to demonstrate that 


ee h (s(e, 5D). g(x, 5) / |e | = [so] _9 (G.10) 


for any 1 < r < 2. In particular, we choose two sequences {5,} and {5 } such that gt) = 25°” 


1) 


for all n > 1 and 5 + 0asn-— oo. For any r < 2, invoking Taylor expansion up to the first 
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order as that of Case b.1 in the proof of Theorem 2.1, we have 
g(@,5,) — 9(x,5q) = Ra), 


where the formulation of R() is 


ee 5Pi(2) - : (30? — 3°) Raa(w) + SRa(a). 


Here, the detail formulations of Taylor remainders Ri(2), R21(x), and Ro(x) are 


avy? 
eee ca, [° 2b (2, a + (8 —3)) at, 


TT ae? 
2( a - 5, ) 1 
Ro(x) = Yh _ (5, +5,” —8,”)) ae, 
0 
(2) saa 52) 4 943) 
Ro (x) = 266 {Fel 2,5, +25) at. 


With the choice that 5) = 25°”) — 0 and the same argument as Step 3 in part (a), we can argue 


that 
[Po/ (2 pall (v/ote.94") = Vote) | a4) 


as n — oo. Therefore, for any 1 < r < 2, we achieve 


(ate) (2.86) | — i 


As a consequence, we achieve the conclusion of part (b) of the theorem. 


G.2.> PROOF OF THEOREM 2.2 


For the sake of presentation, we denote v := 0? and g(x, 6,v) := tf(x, —6,v) +(1—7)f (2a, c6, v) for 
all 6 € O,o0 € OY where f(x, 6,v) is the density of location-scale Gaussian distribution with location 
6 and scale v. For the simplicity of the proof argument, we only focus on the proof for the upper 
bounds of the theorem. The proof for the lower bounds can be argued similarly as that of the lower 
bounds in Theorem 2.1 in Section G.1.2. 

(a) By means of the proof argument with the upper bound of Theorem 2.1, in order to achieve 
the upper bound of part (a), it is sufficient to demonstrate that 


V (9(2, 6), vo), g(x, 6), v)) 


in 
6) 62 €@ |6) 6(2)|3 ju) y(2)|3/2 
vh) vAEQ 


aus (G.11) 
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where 9 = [—1,1] and 2 is a bounded set containing ¢. Assume that the above inequality does 
not hold. It implies that we can find sequences {ow}, {on}, {of}, and {of} such that 


V (gle, dh”, oh), g(a, 34, on?)) 
Ji? — 623 + oD — oP)]9/2 


as n — co. To simplify the presentation, we only consider the most challenging setting 5) => 


0, 5(2) +> 0, yl) + vo, yp?) — vo for some vg € 2. Additionally, we denote 


Dy = [600 — 6/8 + of — 9, 


(2) 


Now, we consider the following settings with gh) and én’. 


Case a.1: 60) 162) /# 1asn-— oo. Similar to the structure of the proof of Theorem 2.1, we also 
divide the proof argument of this case into two key steps. 


Step 1 - Taylor expansion Under this setting, by means of Taylor expansion up to the third 
order, we obtain that 


9(2, 5p ,0p?) — g(a, dn, UR) 


D (G.12) 
“(se — 55, 0h) — f(a, -on )) + (1-7) (Fe cd”, un”) — F(a, 058”, )) 
- Dn 
(62) _ 5())yar (vw) = yl?) a2 alel ¢ @) (2) 
*( x, ay!ay! 06@1 Ay2 v, On > Un ) + Ry (x) 


Dn 


2366 — 4®)ar(uf — o> _alolg 


(2, c52), u®) + Fa(e)) 


lal<3 ay!a2! 06% Qu% 
te D, 
1 (5g) = 657) (wh? — vp? )02 Oert2e0 f (2), (2) 
(= 202 ay!ag! OG%1t2a2 dnt ) + Ri ) 
Dn 
aid, ( des a = oe oe ee (x, 52), o)) + Fa(«)) 
D . 
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where the last equality is due to the PDE structure of location-scale Gaussian distribution, which 
is given by 


O f of 
gz (252) = 2553 (#, 0,0). 


Additionally, Ri(x) and R(x) are Taylor remainders that satisfy the following inequality 


max{|| i (2)|loo;||Ra()lloo} = O (5) — 6227 + ju — 4) 8+) 


for some y > 0. It implies that R,(x)/D, > 0 and Ro(x)/D, — 0 for all x as n > co. Now, by 
means of Taylor expansion up to the third order, we further have 


Qo1t2a2 3—|al +1)7 §2) T Qo t 20247 
Fate, (2, 0802), 02) = a M SEE O.0n) + Raa(x) (G.13) 
T=0 : 


for each a = (a1,Q@2) such that 1 < Ja| < 3. Here, Ro_(x) is a Taylor remainder that satisfies 
|| 22,a(2)|loo = O ([a? [Ptel+7) for all a. By plugging equations (G.13) into (G.12), the following 
holds 


g(x, 68, ub) — g(a, 5, v??) 


Dn 
1 (52) _ 6D yor (y() = y2))aa gritos (2) (2) 
a. a ial ai Foag (Fs On, Un”) 
_ \talg3 2? ay!ag! Ogres 
= a 
os 3clal ye (e+. 1)7(62))7 (6M — 62) ay — y2)) 2 Genre fg gl2) y(2)) 
: lal<3 7=0 2% Tlay!ao! ree Oe 
- D, 
L_ 6% (5p — 6) (uf) — vf )02 
i= a 
mRy(x) + ( Rola) + 22 oa ate Ro,a(x) 
- Dp 


6 og! 
3 Ans SE (a, 82), vf) + RC@) 
[=1 


Dey ; 
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where the detail formulations of A,,; and R() are as follows 


1 (62) — ga (yd — yaa 


A = 
me ’ Qo2 aylag! 
1,2 
ag 5 ee or Sin)? 
. i ye Tlay!ag! j 
1 car(gQ) — g@yar (yd — y2)ye0 
R(z) = 1Ri(x)+(1—7m)Ro(z)+ S> =—S ( a Se eG 


202 ay!ag! 
lal<3 Bee 


for any 1 <1 <6andaze€R. Here, the ranges of a1, a2 in the first sum of A, satisfy aj + 2a2 = l, 
1 < ja| < 3 while the ranges of a ,a2,7 in the second sum of A, satisfy ay + 2a2+7 = l, 
0<7r<3-|al, and 1 < Ja| <3. According to the hypothesis 5D 152) / 1, we have 


On lon 6” | arres: 
Therefore, we have 


af? — oP erfol? — Por Raa(ayles _ © {l6e? — 5x Ielen? — vf Toalsy? ele) 


= > 0. 
D, Dn, 


As a consequence, we have ||R()||o/Dn > 0 as n > oo. 


Step 2 - Non-vanishing coefficients and Fatou’s argument Assume that all the coefficients 
Ant /Dn — 0 for all 1 <1 <6 as n — oo. We denote the following key term 


Wess max { [50 - 52), jw) 2 vv) 


As 15?) 6? — 6(?)| # oo, we also have (5?) | /Mn # oo. Now, we denote 5?) [My >, (5°?) — 
Js )/Mn > y, and (vw) - v?)) (M2 —+ zasn— oo. From the definition of Mp, at least one among 


y and z is different from 0. By dividing both the numerator and the denominator of A,,;/Dy by 


M, as 1<1<3,asn— o, we have the following system of polynomial equations 


cy” + 2—2cry = 0, 
m(l—2m) 3,1 Co 1 
2(1— 7) 


2a tes 
5uvy = 0. 


The above system of polynomial equations leads to 7(1 — 27)y(y? — 3xy + 327) = 0, which only 
holds when y = 0. Therefore, it leads to z = 0, which is a contradiction. It implies that not all 
the coefficients A, 1/D, — 0 as n > co. Denote m, = D,/ max |An,|. According to the previous 


result, we have mp, 4 co. Now, we have that 


gO: asacpiG eres. 28 “pt 
g(@, On“, Un’) — 9(@, On’, Un) — (x, 0, vo) 


Mn 
Dn 


5d 


for some coefficients 7; such that not all of them are 0. Similar to the proof argument of Theorem 2.1, 
by invoking Fatou’s lemma with V (s(e, 5) uW)), g(a, 5 (2) vn)) /Dn — 0, the following equation 
holds 
6 
o! f 
2 Mage (m0, vo) =0 


og! 
for almost surely x. However, due to the linear independence of {Fro 0.00)}, we have 7 = 0 


for all 1 <1 <6, which is a contradiction. Therefore, Case a.1 does not hold. 


Case a.2: 6) 162) — 1asn-— oo. It implies that 50} oP = 60) — oo. Similar to Case a.2 in 
the proof of Theorem 2.1, the main challenge with that setting is that R(a)/D,, does not converge 
to 0; therefore, we cannot hinge upon the previous argument in Case a.1 to argue the contradiction 
with this case. To be able to deal with that problem, we will demonstrate two key properties under 
that setting: max {|Anu|} /Dn A 0 and ||R(z)|lo0/ max, |An,| > 0. Indeed, we have the following 


possibilities regarding 5) @) ; vw), and v2), 
Case a.2.1: lw) — v2) / jo) — 5?) 5) + a} — oo. Assume by the contrary that the 


following term max, {|An,|} /Dn — 0. From the formulation of An», we have 


|An2| = ; (v6) v2) e(52) 6M) (52) + §())) 


( = jo — 0 


’ 


as n is sufficiently large due to the assumption of Case a.2.1. Since An2/D, — 0, it implies that 
(vw) —v?))/Dp — 0. Therefore, it leads to (5) — §?)) (62) +6) /Dp, —+ 0. As \5)| 6) ae > 
oo, the previous limit implies that \5Q) - 62) 2 /Dn — 0. These results mean that 


Joh? — vf /2 + 8? — 6? 


l= 
Dy 


which is a contradiction. Therefore, we have max {|An,i \/D, A 0. Now, for any 1 < Ja| < 3, as 


n is sufficiently large, we have 


[in — bn | fon” — Uh I Rajo(@)loo < O(\Bn? — 418 Jon? — vm |22]8n" FH) 
=. 1 2 
pe {Anal} joe — of 


Hence, we achieve that ||R(z)||oo/ max {|An,i|} > 0 forall x eR. 
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Case a.2.2: jw) - PI/{ a — 52) 150) ~ at —+¢#c. Under that assumption, we have 


(OP =a) = (62) — 6D) (62) +00) & 16? = 0260 +5?) 


n 


1 
|An,2| — D 


when n is sufficiently large. If we have max {|An,l} /Dn — 0, then |A,,2|/Dn leads to both 
(v) = v?))/Dn — 0 and (on oa on) (on? + 5) /Dn — 0, which does not hold according to the 
argument of Case a.2.1. Therefore, max. {|An,|} /Dn # 0. On the other hand, for any 1 < |a| < 3, 
as n is sufficiently large, we have 
(1) s(2)yayy,,01) _ ,,(2) a2) (2) )3-|a 
aD — Sel — APH Rael Nan 2 (l0H ~ HH Ilo ~ oie) 
= 1 is 2 
max {|An,il} oe? — dy |]5n + 6?) 
O (|6n — 62 lal? [S-a+>) 


[an? — 89 J5n? + dn | 


Hence, we achieve that ||R(z)||o/ max {|An,i|} — 0 for allz ER. 


Case a.2.3: lw) — ve?) / tse — 5?) 5) + a} — c. Without loss of generality, we assume 


that (vw) — v?)) 15) — 52) (5) + 5)) — cas the argument when this ratio goes to —c is similar. 
Under this assumption, we have 
|An,3| 
Jbn) — bn on? + dn Ibn 


|e (1 —a)ce(e+ 1)? 
2 4 


> 0. 


Therefore, as n is sufficiently large, we have |An,3| 2 js) = 5) 5) + 5?) )52)1, If we have 


max. {|Anul|} /Dn — 0, then |An.3| /Dn — 0 leads to 5) — 6?) 15) 4.62) 5 1D, — 0. Therefore, 
the following holds 


uG — uf 9/2/4 16 — a8) + Pb + 0, 


which means jw) - vw?) / \5()/2 — oo — a contradiction to the assumption of Case a.2.3. Hence, 
max. {|Anw|}/Dn 4% 0. On the other hand, for any 1 < |a| < 3, as n is sufficiently large, we have 
(1) s(2))ayy,,0) _ ,,(2))a2) 5(2))3-|a 
aD — SP yp — oP Ler Raaaao © {16h — aes? — olla) 
= Diet Dic 
1285 Ul Anal Jon) — BP [180 + P15 
O ([8? — af? lal 628-2147) 


on? — 6 16 + 6?) (6? 
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Thus, we obtain that || R(z)||./ max {|An,il} —> 0 for allx ER. 


Governed by the results from Case a.2.1, Case a.2.2, and Case a.2.3, we finally achieve that 
(Ss 
max {|An,il} /Dn # 0 and R(@)lloo/ max, |Anz| + 0. Denote mj, = D,/ max {|An,il}- Then, we 
will have m/, 4 oo. Thus, the following limit holds 


6 
sf JoeSm sYa) = G(@, dn sm) _ 


/ 
i Dn 1 96t 


for some coefficients 7; such that not all of them are 0. By means of Fatou’s lemma with the ratio 
V (s(e, ee vs), g(x, bn, oh) /Dn — 0, we obtain that 


6 
og! 

ASH (a, 0, vo) = 0. 

i=1 


| 
However, due to the linear independence of ae 0, w)}. we will have 7; = 0 for all 1 <1 <6, 


06! 
which is a contradiction. Therefore, Case a.2 does not hold. As a consequence, we achieve the 
conclusion with the upper bound of part (a) of the theorem. 
(b) Similar to the proof argument of part (a), it is sufficient to demonstrate that 


V (g(t, 6%, 0), g(x, 5), v)) 


( Ae . au 

5D) 52ECO 

req [6M] —[6@|]} + fo — vp 

where 9 = [—1,1] and 2 is a bounded set containing . Assume that the above inequality does 


not hold. It implies that we can find sequences {on}, {a}, {on}, and {of} such that 


V (o(2, 88, wf), o(@, 69? of)) 


+0 


(1) )12 


4 
a ~ 132! + jo — vf 


as n — oo. Similar the proof argument of part (a), we only consider the most challenging setting 
gh) +> 0, 5 (2) +> 0, us) > v0, vy?) — uo for some vg € (2. For the convenience of presentation, we 
denote 


4 
+ log? — oy? ?. 


Dy =| — a2) 


Now, we have three settings with 6) and ie in the proof of part (b). 


Case b.1: 5) 1562) A 1lasn— co and 6) 62) > 0 for all n. Under this case, we have 


Dn = [5 — 62 [4 + Jo — oP. 


To facilitate the proof argument of this case, we also divide it into two key steps. 


Step 1 - Taylor expansion Using the similar argument as that of part (a), by means of Taylor 
expansion up to the fourth order, we get the following representation 


: of (CR) ee 
Br »—On Un R 
a, 8© a) — g(e,62 v@) _ By Potag 1 + Re) 


Dy Dy 


where the formulations of B,,; and R(zx) are as follows 


by = Lye A =A = 
a 2 202 ay!ag! 


1 3 1 ar (6 2))7 6) = 52) yon (wD 2 y?))aa 


Pence Qo2 Tlaz!ag! , 
es ie ie 1 (6) — 6) (YQ — yPyo2 _ 
Ria) = ZRilx) + 5 falx) + Xu oF ai Roq(2). 


Here, the ranges of aj, a2 in the first sum of B, satisfy aj + 2a2 = 1, 1 < |a| < 4 while the 
ranges of aj,a2,7 in the second sum of B,) satisfy aj + 2ag +7 = 1,0 <7 < 4-—|al, and 


1 < Ja| < 4. Additionally, Ri(x) is a Taylor remainder from expanding f(z, —5) yD) around 


Pes ase) v?)) up to the fourth order, R2(z) is Taylor remainder from expanding f (a, cM) vs) 


52) y?)) up to the fourth order, and R2.(x) is Taylor remainder from expanding 


falas +200 f 


around f(z,c 
gmteo2F 


96014202 O61 4202 a a 
argument of Case a.1, the assumption of Case b.1 is sufficient to guarantee that R(x)/D, —> 0. 


(x, cd), v2?) around (a —§?) yO) up to the order 4 — Ja|. Similar to the 


Step 2 - Non-vanishing coefficients and Fatou’s argument Assume that all the coefficients 
Bnt/Dn — 0 for all 1 <1 <8 as n— oo. Remind from part (a) that we denote 


ee max { [5.0 52), Jv()) = verve} 


Additionally, we also denote 56) [Mn > 4, (52) - 5) (Mn — y, and (vw) — v?)) (Me, > zas 
n — co where at least one from y and z is different from 0. Due to the assumption that 6D) 52) > 0, 
we have x(a — y) > 0. Now, by dividing both the numerator and the denominator of Bnj/Dn by 


M1. as1<1<4,asn— oo, we have the following system of polynomial equations 


yi +z2—2cy =0, 
4 yz a Zz? LYZ ce gz ry? xy? Qa>y _ 
4!’ 4 8 2 2 6°" 2 a= 


0. 


When x = 0, the above system of polynomial equations leads to y = z = 0, which is a contradiction 
with the assumption that at least one of y, z is different from 0. When x 4 0, the above system of 
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polynomial equations leads to y? — 4ry? + 6x7y — 42° = 0, which leads to y = 22 — a contradiction 
to the condition a(x — y) > 0 and 4 0. Therefore, not all of the coefficients By7/Dn — 0 as 
n — oo. From here, using the same proof argument as that of Case a.1 in part (a), we achieve the 
conclusion that Case b.1 cannot hold. 


Case b.2: 5) 1562) A 1lasn— co and 6) 62) < 0 for all n. Under this case, we have 


Di, = [ol + IE + fol — oP. 
By means of Taylor expansion up to the fourth order, we obtain the following representation 


g(x, 5s, vo?) — g(a, 6°), v??) 


Dn 
1 1 
5 (f(x, —dn, om) — f(a, dn, on”) + 5(F(@ bn? wh?) — F(@, dn”, on”) 
2 > 
8 a! = 
% Cnt ger 4 0) + Ra 


Dy 


where the formulations of C;,; and Ri (x) are as follows 


— 52) s(1yar,,1) _ (2) yan 
6. = 1 3 1 (-dn One o On Un’) 
; popee 2e2 ala! 
re 3 1 core me )yr (5) re 6 (2) yoa (y() = yl?) Jaa 
erases Qo2 Tlaylay! ; 
= is ts 1 (50? + 6P))™ (of?) — up) ~ 
= a Rao(2). 
R(x) 5 f(z) a 5 Ra(2) r yy 503 ales 2,0(X) 


Here, the ranges of a1, a2 in the first sum of C),) satisfy a1 + 2a2 =1, 1 < |a| < 4 while the ranges 
of a1, Q2,7 in the second sum of C;,; satisfy ay +2ag+7 =1,0<7 <4—|al, and 1 < |a| < 4. Ad- 


ditionally, Ri (x) is a Taylor remainder from expanding f(z, = 5) pW) around f(z, 5) y 2 (2)) up to 


the fourth order, R2(z) is a Taylor remainder from expanding f(z, 5), vs) around f(z, 42), v?)) 
oe 1 +202 

up to the fourth order, and R24(x) is a Taylor remainder from expanding oar es §(2) v2) 
Oot 2e2 F 
O5et202 
check that ||R(x)||oo/Dn 4 0 as n > oo. 


Assume that all the coefficients Cay Dy —> 0 for alll <l<8asn— oo. We denote 


around a, 6? ) vw? )) up to the order 4 — |a|. Due to the assumption of Case b.2, we can 


M,, = max {13 + 6)|, |v) — vf) va} : 


From the definition of M,, we can denote 60 )1/Mp > 4, (62) + 50)) My, — yi, and (vo) — 


60 


v?)) /M2 — z, as n — oo where at least one from y, and z, is different from 0. Due to the 
assumption that ere < 0, we have 71(y; — 21) < 0. Now, by dividing both the numerator and 
the denominator of C;,1/Dn by M4 as1<1< 4, as n—- ow, we have the following system of 
polynomial equations 


y? + 21 — 2n1y, = 0, 


Yt ¥it. | t Miyiza 4 ti viye tiv, 2aty _¢ 
Ar’ 4 | 8 5 2 6 2 3 


If x, = 0, the above system leads to y; = z1 = 0, which is a contradiction with the assumption of 
y1,21- As x1 4 0, the above system of polynomial equations leads to y,; = 27%; — a contradiction 
to the condition z1(y; — 21) < 0 and x #0. Therefore, not all of the coefficients C,,;/Dy — 0 as 
n — oo. From here, using the same proof argument as that of Case a.1 in part (a), we achieve the 
conclusion that Case b.2 cannot hold. 

Case b.3: a i Ge — lasn— oo. Under this assumption, we have gg > 0.as nis sufficiently 
large. Without loss of generality, we assume that 6) 52) > 0 for all n. Therefore, we have 


Di = [ot — 524+ [of — oP 
Remind from case b.1 that we have the following representation 


oy 


8 
Bui zer(t,—On Un?) +R 
g(r, 59 wf) — of, 52) v0) _ x Prtage Pe) FR 


06! 
Dy, Dn 


The main challenge in Case b.3 is that || R(x)||..o/Dn #4 0 as n + oo. To avoid this issue, we will 
utilize the technique in Case a.2 of the proof of Theorem 2.2. In particular, we will demonstrate 
two key properties: ||R(2)||oo/ max |Bn| + 0 and max |Bn,t|/Dn A 0 as n > oo. 


Under the settings of Case a.2.1 and Case a.2.2 in the > proof of part (a), with the same argument 
as that in these cases, we have |Bn,2|/Dn 7 0 and ||R(2x)||oo/|Bn,2| + 0. Therefore, we have 
R(x)/ max |Bn,| + 0 and max |Bnt|/Dn “* 0 under the settings of Case a.2.1 and Case a.2.2. It 


implies that we only need to focus on the setting that 


uf — uf 71 [8 — 6200 + a2}, 


Without loss of generality, we assume that (wp — v?)) | (50) — 52) (60 + 5)) — 1 as the 


argument for the setting that this ratio goes to -1 is similar. Under this setting, we can easily check 
that 


[Bnal/4 100? - ase} =o) 


61 


Therefore, as n is sufficiently large, we have 
|Bnal 2 lon? — 5Y 162). 


If we have max |Bni|/Dn — 0, then |Bn,a| /Dn — 0 leads to \5Q) - 5) 5) 13 Dy, — 0. Therefore, 
the following holds 


(1) — (2))2 7) |g) _ §(2)])5(2))3 
ler — Un 9 bn? — On On” ¢ > 00; 


which means jo) — vy?) / 50) /2 — oo, which is a contradiction to the assumption that (wv) — 


ve?) / (5) — 52) (60 + ay} — 1. Thus, we have max |Bni|/Dn # 0. On the other hand, as 


n is sufficiently large, we have 


a) — AP e1oh? — Pfr Raalelao © (J6H — He? IeHlok — oh? etsy? ely) 


apy, Baal} 7 [50 — 6) 6/2 


O ([8? — 82 lal 60) 4-247) 
= > 0. 
5) = 50?) 1) 52)|3 


It implies that || R(x)||o/ max {|Bnu|} + 0. From here, using the same argument as that of Case 


a.2.3, we obtain the contradiction, which leads to the conclusion that Case b.3 cannot hold. As a 
consequence, we achieve the conclusion of part (b) of the theorem. 
G.3 Proof of extra results 


In this appendix, we provide proof for an additional result with the non-polynomial convergence 
rate of MLE 6™° under the known variances setting (2.1). 


Proposition 4. Under the symmetric regime of the true model (2.1), we have 


bn€O 


where 0 = [—1,1]. Here, Es, denotes the expectation taken with respect to product measure with 
miazture density of Y1,...,Yn under the model (2.1). 


Proof. We divide our argument for the proof of this result into two key parts. 


Part 1 - Upper bound of Hellinger distance between mixing densities in terms of 
their corresponding parameters To obtain the conclusion for this inequality, we first prove 
the following key result 


(1) (2) (1) _ s(2)|" _ 
soit cg” (98 ), g(x, 6 )) /|6 6 | 0 (G.14) 
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for any r > 1. In fact, we construct two sequences {on} and {on} such that so = —§2) 
for alln > 1. Then, it is clear that h (a(x, 84), 9(2, 80)) = 0 for all n > 1. Therefore, it is 


straightforward that h (o(2, 88), 9(@, 60)) < [st — a for any r > 1. As a consequence, we 


achieve the conclusion of (G.14). 


Part 2 - Le Cam’s argument for minimax lower bound Now, we follow the traditional Le 
Cam’s argument for minimax lower bound to achieve the conclusion with non-polynomial conver- 
gence rate of 6™*© to 5, Yu (1997). In particular, due to the result from (G.14), for any €, > 0 


sufficiently small and any fixed r > 1, we can find 6) and re such that ge — | = 2e, and 


h (s(e, 5), g(a, 5?) )) < Ce’, where C is a fixed positive constant. Invoking Lemma 1 from Yu 
(1997), the following inequality holds 


sgn |REM—ho]= aap, Balin Sl Ze0[t-V (o"(28t) 0" (se)], (G8) 


where g” (2, on) denotes the density of n i.i.d. samples Y;,..., Y,. By means of classical inequality 


between total variation distance and Hellinger distance V < h, we obtain that 


V (9"(x, 60), 0% (@, 8)) <b (g"(w, 6), 9"(@,62)) < 1-0 - C2ey”. 
By choosing C?e?” = 1/n, it is clear that 
En 1 —_V (s" («, 60) gg" (z,6?)))] > En = ae (G.16) 
Combining the results from (G.15) and (G.16), we achieve the conclusion that 


sup Ey, [a = 6, >no-ur 
bn€O 


for any r > 2. 
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